RESOLVED FIXED Bug 41539
Move BOM handling out of the lexer and parser
https://bugs.webkit.org/show_bug.cgi?id=41539
Summary Move BOM handling out of the lexer and parser
Oliver Hunt
Reported 2010-07-02 14:08:24 PDT
Move BOM handling out of the lexer and parser
Attachments
Patch (12.18 KB, patch)
2010-07-02 14:14 PDT, Oliver Hunt
ggaren: review+
Oliver Hunt
Comment 1 2010-07-02 14:14:40 PDT
Darin Adler
Comment 2 2010-07-02 14:52:34 PDT
Comment on attachment 60402 [details] Patch > + bool scratch = false; > + m_source = UString(m_source.rep()->copyStringWithoutBOMs(false, scratch)); It's a little awkward to have this on StringImpl and have to dig down to the rep(). Normally we call functions directly on UString or String. > + // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. > + // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. This comment seems to belong somewhere else. Obviously in a function named copyStringWithoutBOMs we strip only BOMs. I think the comment should ideally go at some call site that is making the policy decision. Sadly, there are multiple places! I think the interface, where you pass in a boolean to tell it whether to skip the check for BOMs, is really awkward. Can we just break this into multiple functions? I could see (a) one function that just answers the question of whether there is a BOM, (b) another that strips the BOMs, and (c) a third for the case where you want to strip in the unlikely case that a BOM exists. Then we wouldn't need boolean arguments out or in at all. CachedScript would use the (a) and (b) functions and the other call sites would use the (c) function that combines (a) and (b). The constant byteOrderMark ought to go into CharacterNames.h. I guess that would need to move to WTF if this code is in WTF, though. > + const size_t len = m_length; We normally don’t use const in cases like this. I really don't like the booleans arguments in copyStringWithoutBOMs, so I'm going to say review-, but you could convince me to change my mind.
Geoffrey Garen
Comment 3 2010-07-02 14:54:55 PDT
Comment on attachment 60402 [details] Patch I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM. I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy).
Darin Adler
Comment 4 2010-07-02 14:56:08 PDT
(In reply to comment #3) > (From update of attachment 60402 [details]) > I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM. > > I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy). Well, ggaren overrode my review- with a review+. I agree on removing the shrinkToFit call. But I’m not sure I agree on optimizing the case where a BOM is partway through the string. Is this case really important enough to optimize?
Oliver Hunt
Comment 5 2010-07-02 15:31:55 PDT
WebKit Review Bot
Comment 6 2010-07-02 15:57:12 PDT
http://trac.webkit.org/changeset/62410 might have broken Qt Linux Release
Geoffrey Garen
Comment 7 2010-07-02 16:11:50 PDT
Oops! Didn't mean to stomp Darin's review. Not sure how that happened.
Zoltan Herczeg
Comment 8 2010-07-03 00:20:30 PDT
I am happy this code moved out of the lexer. Nice patch.
Alexey Proskuryakov
Comment 9 2010-07-06 10:15:46 PDT
Does this mean that clients using JSC API won't get this behavior any more? Is this a problem, and/or a spec violation?
Alexey Proskuryakov
Comment 10 2010-07-06 10:24:20 PDT
Oliver explained to me in person that WebCore CachedScript BOM handling doesn't replace JSC one. > // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. > // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. This is clearly out of place in StringImpl.
Note You need to log in before you can comment on or make changes to this bug.