Bug 42102 - Moving BOM stripping from JSC specific code to V8+JSC shared code caused V8 crashes
Summary: Moving BOM stripping from JSC specific code to V8+JSC shared code caused V8 c...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore JavaScript (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other All
: P2 Normal
Assignee: Tony Gentilcore
URL: http://www.1616.net/
Keywords:
Depends on: 42224
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-12 12:58 PDT by Tony Gentilcore
Modified: 2010-07-13 18:19 PDT (History)
6 users (show)

See Also:


Attachments
Patch (2.10 KB, patch)
2010-07-12 14:59 PDT, Tony Gentilcore
no flags Details | Formatted Diff | Diff
Patch (1.70 KB, patch)
2010-07-12 16:41 PDT, Tony Gentilcore
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tony Gentilcore 2010-07-12 12:58:36 PDT
It appears that https://bugs.webkit.org/show_bug.cgi?id=41539 moved BOM stripping from JSC specific code to CachedScript which is shared by V8 & JSC.

This caused crashes in V8: http://code.google.com/p/chromium/issues/detail?id=48475

I'm looking into a fix now. Oliver, please let me know if you have any suggestions or insight.
Comment 1 Tony Gentilcore 2010-07-12 14:59:17 PDT
Created attachment 61275 [details]
Patch
Comment 2 Oliver Hunt 2010-07-12 15:26:27 PDT
Comment on attachment 61275 [details]
Patch

How is v8 getting the text without boms removed?  this patch basically makes it so that decoded script data has the BOMs removed.  If V8 is getting scripts from webcore without decoding occuring then v8 is breaking itself.

That said the guard should be USE(JSC) not "WTF_USE_JSC"
Comment 3 Tony Gentilcore 2010-07-12 16:17:45 PDT
(In reply to comment #2)
> (From update of attachment 61275 [details])
> How is v8 getting the text without boms removed?

Ah yes, that is the key question. It gets it via bindings/v8/ScriptSourceCode.h. That seems to be the equivalent of JSC's StringSourceProvider to which you added the copyStringWithoutBOMs() call to the ctor. So an alternative and cleaner fix is to add the same code to that ctor that you added to StringSourceProvider's.

>  this patch basically makes it so that decoded script data has the BOMs removed.  If V8 is getting scripts from webcore without decoding occuring then v8 is breaking itself.

Here's where I'm not understanding something. Hopefully you can fill me in. V8 skips over BOMs, treating them as whitespace. This method copies the string, stripping the BOMs. But it doesn't appear to actually be doing anything with the BOM. So why pay the expense of the string copy to strip instead of just skipping them as whitespace during parsing.

> 
> That said the guard should be USE(JSC) not "WTF_USE_JSC"

Noted.
Comment 4 Oliver Hunt 2010-07-12 16:23:42 PDT
(In reply to comment #3)
> (In reply to comment #2)
> > (From update of attachment 61275 [details] [details])
> > How is v8 getting the text without boms removed?
> 
> Ah yes, that is the key question. It gets it via bindings/v8/ScriptSourceCode.h. That seems to be the equivalent of JSC's StringSourceProvider to which you added the copyStringWithoutBOMs() call to the ctor. So an alternative and cleaner fix is to add the same code to that ctor that you added to StringSourceProvider's.
> 
> >  this patch basically makes it so that decoded script data has the BOMs removed.  If V8 is getting scripts from webcore without decoding occuring then v8 is breaking itself.
> 
> Here's where I'm not understanding something. Hopefully you can fill me in. V8 skips over BOMs, treating them as whitespace. This method copies the string, stripping the BOMs. But it doesn't appear to actually be doing anything with the BOM. So why pay the expense of the string copy to strip instead of just skipping them as whitespace during parsing.

We have found historically that there is a substantial cost to allowing BOMs to be scattered throughout the source -- the lexer needs to do a lot more work for any symbol that is more than a single character long.
Comment 5 Tony Gentilcore 2010-07-12 16:41:08 PDT
Created attachment 61289 [details]
Patch
Comment 6 Oliver Hunt 2010-07-12 16:43:29 PDT
Comment on attachment 61289 [details]
Patch

As long as the V8 lexer handles boms itself this is unlikely to be a win, you should test impact after removing bom handling from the v8 lexer.
Comment 7 Tony Gentilcore 2010-07-12 17:10:55 PDT
(In reply to comment #6)
> (From update of attachment 61289 [details])
> As long as the V8 lexer handles boms itself this is unlikely to be a win, you should test impact after removing bom handling from the v8 lexer.

@ager, I'm landing this as a correctness fix. WebKit now guarantees not to pass BOMs to V8, so do you know if it is possible to remove the BOM check from scanner.cc? If so, this might actually be a win for V8. If not, we probably want to look into a clean way to avoid this extra copy for V8 without a #if guard in CachedScript.
Comment 8 WebKit Commit Bot 2010-07-12 19:57:59 PDT
Comment on attachment 61289 [details]
Patch

Clearing flags on attachment: 61289

Committed r63162: <http://trac.webkit.org/changeset/63162>
Comment 9 WebKit Commit Bot 2010-07-12 19:58:04 PDT
All reviewed patches have been landed.  Closing bug.
Comment 10 Mads Ager 2010-07-13 02:13:39 PDT
The BOM handling in the V8 scanner doesn't cost. I have removed it and I see no performance impact.

Also, I don't think the BOM handling should be removed from the scanner. I see a couple of issues:

1. Stripping BOM chars outside of the scanner means that all BOM chars are stripped no matter where they occur. This is not done in any of the other browsers. In particular BOM chars are not stripped from string literals. BOM chars are only treated as whitespace between tokens. This has been treated as a security issue in Firefox: 

http://www.mozilla.org/security/announce/2008/mfsa2008-43.html
https://bugzilla.mozilla.org/show_bug.cgi?id=430740

V8 is following the other engines here.

2. There is no handling of BOM chars in strings generated in JavaScript code and passed to eval if the handling is removed from the scanner. Therefore the handling of BOM in script blocks and in strings generated in JS is different.

3. The stripping currently done in WebKit only handles 0xFEFF. The other engines handle 0xFFFE as well.

For the above reasons i don't think BOM chars should be stripped early like this. It should be handled by the scanner and only between tokens. Oliver, what do you think?
Comment 11 Zoltan Herczeg 2010-07-13 03:40:28 PDT
(In reply to comment #10)
Wow, this is interesting. I have tried moz regression test:

https://bug430740.bugzilla.mozilla.org/attachment.cgi?id=318439

and it was passed. After a little thinking, I changed the \ufffe to \ufeff, and got the following output:

861480
STATUS: Do not strip format-control characters from string literals
evildone
FAILED!: [reported from test()] Do not strip format-control characters from string literals
FAILED!: [reported from test()] Expected value 'a%EF%BF%BE,+doevil()%5D)//', Actual value 'a',%20evildone'
FAILED!: [reported from test()]

Evil things happen here.

The Unicode spec says (in chapter 2.13):

To have an efficient way to indicate which byte order is used in a text, the Unicode Standard contains two code points, U+FEFF zero width no-break space (byte order mark) and U+FFFE (a noncharacter), which are the byte-ordered mirror images of each other. When a BOM is received with the opposite byte order, it will be recognized as a noncharacter and can therefore be used to detect the intended byte order of the text. The BOM is not a control character that selects the byte order of the text; rather, its function is to allow recipients to determine which byte ordering is used in a file.

"zero width no-break space" : I cannot find the definition of "no-break space", but it implies it is NOT a regular white space. AB\ufeffCD is simply an ABCD identifier, or 12\ufeff34 is number 1234. Looks like neither V8 nor SS do the right thing.
Comment 12 Oliver Hunt 2010-07-13 12:53:42 PDT
It looks like there's a change in behaviour from ES3 to ES5 -- ES3 requires BOMs to be stripped, ES5 says we should treat BOMs as whitespace.  I'm just harassing brendan to find out what the correct behaviour is meant to be.