RESOLVED FIXED Bug 42102
Moving BOM stripping from JSC specific code to V8+JSC shared code caused V8 crashes
https://bugs.webkit.org/show_bug.cgi?id=42102
Summary Moving BOM stripping from JSC specific code to V8+JSC shared code caused V8 c...
Tony Gentilcore
Reported 2010-07-12 12:58:36 PDT
It appears that https://bugs.webkit.org/show_bug.cgi?id=41539 moved BOM stripping from JSC specific code to CachedScript which is shared by V8 & JSC. This caused crashes in V8: http://code.google.com/p/chromium/issues/detail?id=48475 I'm looking into a fix now. Oliver, please let me know if you have any suggestions or insight.
Attachments
Patch (2.10 KB, patch)
2010-07-12 14:59 PDT, Tony Gentilcore
no flags
Patch (1.70 KB, patch)
2010-07-12 16:41 PDT, Tony Gentilcore
no flags
Tony Gentilcore
Comment 1 2010-07-12 14:59:17 PDT
Oliver Hunt
Comment 2 2010-07-12 15:26:27 PDT
Comment on attachment 61275 [details] Patch How is v8 getting the text without boms removed? this patch basically makes it so that decoded script data has the BOMs removed. If V8 is getting scripts from webcore without decoding occuring then v8 is breaking itself. That said the guard should be USE(JSC) not "WTF_USE_JSC"
Tony Gentilcore
Comment 3 2010-07-12 16:17:45 PDT
(In reply to comment #2) > (From update of attachment 61275 [details]) > How is v8 getting the text without boms removed? Ah yes, that is the key question. It gets it via bindings/v8/ScriptSourceCode.h. That seems to be the equivalent of JSC's StringSourceProvider to which you added the copyStringWithoutBOMs() call to the ctor. So an alternative and cleaner fix is to add the same code to that ctor that you added to StringSourceProvider's. > this patch basically makes it so that decoded script data has the BOMs removed. If V8 is getting scripts from webcore without decoding occuring then v8 is breaking itself. Here's where I'm not understanding something. Hopefully you can fill me in. V8 skips over BOMs, treating them as whitespace. This method copies the string, stripping the BOMs. But it doesn't appear to actually be doing anything with the BOM. So why pay the expense of the string copy to strip instead of just skipping them as whitespace during parsing. > > That said the guard should be USE(JSC) not "WTF_USE_JSC" Noted.
Oliver Hunt
Comment 4 2010-07-12 16:23:42 PDT
(In reply to comment #3) > (In reply to comment #2) > > (From update of attachment 61275 [details] [details]) > > How is v8 getting the text without boms removed? > > Ah yes, that is the key question. It gets it via bindings/v8/ScriptSourceCode.h. That seems to be the equivalent of JSC's StringSourceProvider to which you added the copyStringWithoutBOMs() call to the ctor. So an alternative and cleaner fix is to add the same code to that ctor that you added to StringSourceProvider's. > > > this patch basically makes it so that decoded script data has the BOMs removed. If V8 is getting scripts from webcore without decoding occuring then v8 is breaking itself. > > Here's where I'm not understanding something. Hopefully you can fill me in. V8 skips over BOMs, treating them as whitespace. This method copies the string, stripping the BOMs. But it doesn't appear to actually be doing anything with the BOM. So why pay the expense of the string copy to strip instead of just skipping them as whitespace during parsing. We have found historically that there is a substantial cost to allowing BOMs to be scattered throughout the source -- the lexer needs to do a lot more work for any symbol that is more than a single character long.
Tony Gentilcore
Comment 5 2010-07-12 16:41:08 PDT
Oliver Hunt
Comment 6 2010-07-12 16:43:29 PDT
Comment on attachment 61289 [details] Patch As long as the V8 lexer handles boms itself this is unlikely to be a win, you should test impact after removing bom handling from the v8 lexer.
Tony Gentilcore
Comment 7 2010-07-12 17:10:55 PDT
(In reply to comment #6) > (From update of attachment 61289 [details]) > As long as the V8 lexer handles boms itself this is unlikely to be a win, you should test impact after removing bom handling from the v8 lexer. @ager, I'm landing this as a correctness fix. WebKit now guarantees not to pass BOMs to V8, so do you know if it is possible to remove the BOM check from scanner.cc? If so, this might actually be a win for V8. If not, we probably want to look into a clean way to avoid this extra copy for V8 without a #if guard in CachedScript.
WebKit Commit Bot
Comment 8 2010-07-12 19:57:59 PDT
Comment on attachment 61289 [details] Patch Clearing flags on attachment: 61289 Committed r63162: <http://trac.webkit.org/changeset/63162>
WebKit Commit Bot
Comment 9 2010-07-12 19:58:04 PDT
All reviewed patches have been landed. Closing bug.
Mads Ager
Comment 10 2010-07-13 02:13:39 PDT
The BOM handling in the V8 scanner doesn't cost. I have removed it and I see no performance impact. Also, I don't think the BOM handling should be removed from the scanner. I see a couple of issues: 1. Stripping BOM chars outside of the scanner means that all BOM chars are stripped no matter where they occur. This is not done in any of the other browsers. In particular BOM chars are not stripped from string literals. BOM chars are only treated as whitespace between tokens. This has been treated as a security issue in Firefox: http://www.mozilla.org/security/announce/2008/mfsa2008-43.html https://bugzilla.mozilla.org/show_bug.cgi?id=430740 V8 is following the other engines here. 2. There is no handling of BOM chars in strings generated in JavaScript code and passed to eval if the handling is removed from the scanner. Therefore the handling of BOM in script blocks and in strings generated in JS is different. 3. The stripping currently done in WebKit only handles 0xFEFF. The other engines handle 0xFFFE as well. For the above reasons i don't think BOM chars should be stripped early like this. It should be handled by the scanner and only between tokens. Oliver, what do you think?
Zoltan Herczeg
Comment 11 2010-07-13 03:40:28 PDT
(In reply to comment #10) Wow, this is interesting. I have tried moz regression test: https://bug430740.bugzilla.mozilla.org/attachment.cgi?id=318439 and it was passed. After a little thinking, I changed the \ufffe to \ufeff, and got the following output: 861480 STATUS: Do not strip format-control characters from string literals evildone FAILED!: [reported from test()] Do not strip format-control characters from string literals FAILED!: [reported from test()] Expected value 'a%EF%BF%BE,+doevil()%5D)//', Actual value 'a',%20evildone' FAILED!: [reported from test()] Evil things happen here. The Unicode spec says (in chapter 2.13): To have an efficient way to indicate which byte order is used in a text, the Unicode Standard contains two code points, U+FEFF zero width no-break space (byte order mark) and U+FFFE (a noncharacter), which are the byte-ordered mirror images of each other. When a BOM is received with the opposite byte order, it will be recognized as a noncharacter and can therefore be used to detect the intended byte order of the text. The BOM is not a control character that selects the byte order of the text; rather, its function is to allow recipients to determine which byte ordering is used in a file. "zero width no-break space" : I cannot find the definition of "no-break space", but it implies it is NOT a regular white space. AB\ufeffCD is simply an ABCD identifier, or 12\ufeff34 is number 1234. Looks like neither V8 nor SS do the right thing.
Oliver Hunt
Comment 12 2010-07-13 12:53:42 PDT
It looks like there's a change in behaviour from ES3 to ES5 -- ES3 requires BOMs to be stripped, ES5 says we should treat BOMs as whitespace. I'm just harassing brendan to find out what the correct behaviour is meant to be.
Note You need to log in before you can comment on or make changes to this bug.