41539 – Move BOM handling out of the lexer and parser

RESOLVED FIXED 41539

Move BOM handling out of the lexer and parser

https://bugs.webkit.org/show_bug.cgi?id=41539

Summary Move BOM handling out of the lexer and parser

Oliver Hunt

Reported 2010-07-02 14:08:24 PDT

Move BOM handling out of the lexer and parser

Attachments
Patch (12.18 KB, patch) 2010-07-02 14:14 PDT, Oliver Hunt	ggaren: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Oliver Hunt

Comment 1 2010-07-02 14:14:40 PDT

Created attachment 60402 [details] Patch

Darin Adler

Comment 2 2010-07-02 14:52:34 PDT

Comment on attachment 60402 [details] Patch > + bool scratch = false; > + m_source = UString(m_source.rep()->copyStringWithoutBOMs(false, scratch)); It's a little awkward to have this on StringImpl and have to dig down to the rep(). Normally we call functions directly on UString or String. > + // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. > + // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. This comment seems to belong somewhere else. Obviously in a function named copyStringWithoutBOMs we strip only BOMs. I think the comment should ideally go at some call site that is making the policy decision. Sadly, there are multiple places! I think the interface, where you pass in a boolean to tell it whether to skip the check for BOMs, is really awkward. Can we just break this into multiple functions? I could see (a) one function that just answers the question of whether there is a BOM, (b) another that strips the BOMs, and (c) a third for the case where you want to strip in the unlikely case that a BOM exists. Then we wouldn't need boolean arguments out or in at all. CachedScript would use the (a) and (b) functions and the other call sites would use the (c) function that combines (a) and (b). The constant byteOrderMark ought to go into CharacterNames.h. I guess that would need to move to WTF if this code is in WTF, though. > + const size_t len = m_length; We normally don’t use const in cases like this. I really don't like the booleans arguments in copyStringWithoutBOMs, so I'm going to say review-, but you could convince me to change my mind.

Geoffrey Garen

Comment 3 2010-07-02 14:54:55 PDT

Comment on attachment 60402 [details] Patch I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM. I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy).

Darin Adler

Comment 4 2010-07-02 14:56:08 PDT

(In reply to comment #3) > (From update of attachment 60402 [details]) > I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM. > > I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy). Well, ggaren overrode my review- with a review+. I agree on removing the shrinkToFit call. But I’m not sure I agree on optimizing the case where a BOM is partway through the string. Is this case really important enough to optimize?

Oliver Hunt

Comment 5 2010-07-02 15:31:55 PDT

Committed r62410: <http://trac.webkit.org/changeset/62410>

WebKit Review Bot

Comment 6 2010-07-02 15:57:12 PDT

http://trac.webkit.org/changeset/62410 might have broken Qt Linux Release

Geoffrey Garen

Comment 7 2010-07-02 16:11:50 PDT

Oops! Didn't mean to stomp Darin's review. Not sure how that happened.

Zoltan Herczeg

Comment 8 2010-07-03 00:20:30 PDT

I am happy this code moved out of the lexer. Nice patch.

Alexey Proskuryakov

Comment 9 2010-07-06 10:15:46 PDT

Does this mean that clients using JSC API won't get this behavior any more? Is this a problem, and/or a spec violation?

Alexey Proskuryakov

Comment 10 2010-07-06 10:24:20 PDT

Oliver explained to me in person that WebCore CachedScript BOM handling doesn't replace JSC one. > // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. > // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. This is clearly out of place in StringImpl.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component WebCore JavaScript

Assignee

Oliver Hunt

Reported

2010-07-02 14:08 PDT

Modified

2010-07-06 10:24 PDT History

CC List

7 users Show

URL

Keywords

Depends on

41549

Blocks

Dependencies

tree graph