Move BOM handling out of the lexer and parser
Created attachment 60402 [details] Patch
Comment on attachment 60402 [details] Patch > + bool scratch = false; > + m_source = UString(m_source.rep()->copyStringWithoutBOMs(false, scratch)); It's a little awkward to have this on StringImpl and have to dig down to the rep(). Normally we call functions directly on UString or String. > + // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. > + // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. This comment seems to belong somewhere else. Obviously in a function named copyStringWithoutBOMs we strip only BOMs. I think the comment should ideally go at some call site that is making the policy decision. Sadly, there are multiple places! I think the interface, where you pass in a boolean to tell it whether to skip the check for BOMs, is really awkward. Can we just break this into multiple functions? I could see (a) one function that just answers the question of whether there is a BOM, (b) another that strips the BOMs, and (c) a third for the case where you want to strip in the unlikely case that a BOM exists. Then we wouldn't need boolean arguments out or in at all. CachedScript would use the (a) and (b) functions and the other call sites would use the (c) function that combines (a) and (b). The constant byteOrderMark ought to go into CharacterNames.h. I guess that would need to move to WTF if this code is in WTF, though. > + const size_t len = m_length; We normally don’t use const in cases like this. I really don't like the booleans arguments in copyStringWithoutBOMs, so I'm going to say review-, but you could convince me to change my mind.
Comment on attachment 60402 [details] Patch I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM. I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy).
(In reply to comment #3) > (From update of attachment 60402 [details]) > I think copyStringWithoutBOMs is unnecessarily inefficient in the case where it finds a BOM. > > I'd have the first loop remember the index of the BOM it found, and make that the starting point of the second loop. I'd also remove the shrinkToFit call (since it does an extra copy). Well, ggaren overrode my review- with a review+. I agree on removing the shrinkToFit call. But I’m not sure I agree on optimizing the case where a BOM is partway through the string. Is this case really important enough to optimize?
Committed r62410: <http://trac.webkit.org/changeset/62410>
http://trac.webkit.org/changeset/62410 might have broken Qt Linux Release
Oops! Didn't mean to stomp Darin's review. Not sure how that happened.
I am happy this code moved out of the lexer. Nice patch.
Does this mean that clients using JSC API won't get this behavior any more? Is this a problem, and/or a spec violation?
Oliver explained to me in person that WebCore CachedScript BOM handling doesn't replace JSC one. > // ECMA-262 calls for stripping all Cf characters, but we only strip BOM characters. > // See <https://bugs.webkit.org/show_bug.cgi?id=4931> for details. This is clearly out of place in StringImpl.