Bug 41539 - Move BOM handling out of the lexer and parser
Summary: Move BOM handling out of the lexer and parser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore JavaScript (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Oliver Hunt
URL:
Keywords:
Depends on: 41549
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-02 14:08 PDT by Oliver Hunt
Modified: 2010-07-06 10:24 PDT (History)
7 users (show)

See Also:


Attachments
Patch (12.18 KB, patch)
2010-07-02 14:14 PDT, Oliver Hunt
ggaren: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Hunt 2010-07-02 14:08:24 PDT
Move BOM handling out of the lexer and parser
Comment 1 Oliver Hunt 2010-07-02 14:14:40 PDT
Created attachment 60402 [details]
Patch
Comment 2 Darin Adler 2010-07-02 14:52:34 PDT
Comment on attachment 60402 [details]
Patch

> +                bool scratch = false;
> +                m_source = UString(m_source.rep()->copyStringWithoutBOMs(false, scratch));

It's a little awkward to have this on StringImpl and have to dig down to the rep(). Normally we call functions directly on UString or String.

> +            // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters.
> +            // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details.

This comment seems to belong somewhere else. Obviously in a function named copyStringWithoutBOMs we strip only BOMs. I think the comment should ideally go at some call site that is making the policy decision. Sadly, there are multiple places!

I think the interface, where you pass in a boolean to tell it whether to skip the check for BOMs, is really awkward. Can we just break this into multiple functions? I could see (a) one function that just answers the question of whether there is a BOM, (b) another that strips the BOMs, and (c) a third for the case where you want to strip in the unlikely case that a BOM exists. Then we wouldn't need boolean arguments out or in at all.

CachedScript would use the (a) and (b) functions and the other call sites would use the (c) function that combines (a) and (b).

The constant byteOrderMark ought to go into CharacterNames.h. I guess that would need to move to WTF if this code is in WTF, though.

> +            const size_t len = m_length;

We normally don’t use const in cases like this.

I really don't like the booleans arguments in copyStringWithoutBOMs, so I'm going to say review-, but you could convince me to change my mind.
Comment 3 Geoffrey Garen 2010-07-02 14:54:55 PDT
Comment on attachment 60402 [details]
Patch

I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM.

I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy).
Comment 4 Darin Adler 2010-07-02 14:56:08 PDT
(In reply to comment #3)
> (From update of attachment 60402 [details])
> I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM.
> 
> I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy).

Well, ggaren overrode my review- with a review+.

I agree on removing the shrinkToFit call. But I’m not sure I agree on optimizing the case where a BOM is partway through the string. Is this case really important enough to optimize?
Comment 5 Oliver Hunt 2010-07-02 15:31:55 PDT
Committed r62410: <http://trac.webkit.org/changeset/62410>
Comment 6 WebKit Review Bot 2010-07-02 15:57:12 PDT
http://trac.webkit.org/changeset/62410 might have broken Qt Linux Release
Comment 7 Geoffrey Garen 2010-07-02 16:11:50 PDT
Oops!

Didn't mean to stomp Darin's review. Not sure how that happened.
Comment 8 Zoltan Herczeg 2010-07-03 00:20:30 PDT
I am happy this code moved out of the lexer. Nice patch.
Comment 9 Alexey Proskuryakov 2010-07-06 10:15:46 PDT
Does this mean that clients using JSC API won't get this behavior any more? Is this a problem, and/or a spec violation?
Comment 10 Alexey Proskuryakov 2010-07-06 10:24:20 PDT
Oliver explained to me in person that WebCore CachedScript BOM handling doesn't replace JSC one.

>            // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. 
>            // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. 

This is clearly out of place in StringImpl.