Summary: | Greek sigma is handled wrong in case independent regexp. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Erik Corry <erikcorry> | ||||||
Component: | JavaScriptCore | Assignee: | Gavin Barraclough <barraclough> | ||||||
Status: | RESOLVED FIXED | ||||||||
Severity: | Normal | CC: | ap, barraclough, eae, msaboff, oliver, zherczeg | ||||||
Priority: | P2 | ||||||||
Version: | 528+ (Nightly build) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
URL: | https://mail.mozilla.org/pipermail/es-discuss/2012-March/021631.html | ||||||||
Attachments: |
|
Description
Erik Corry
2012-03-23 10:18:17 PDT
The "u" flag appears to be proposed in <http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/>. But this bug should not depend on it, as far as I understand it. ς is lowercase variant of Σ when used in final position in a word. In other positions, it's σ. I'm not quite sure what the requested behavior is. Should /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")? We had a long discussion about it in PCRE: http://bugs.exim.org/show_bug.cgi?id=1208 General caseless matching is just too overcomplicated. > /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")
Both examples return true in V8, and I believe that is the correct behaviour.
Created attachment 133729 [details]
Fix
(In reply to comment #2) > We had a long discussion about it in PCRE: > http://bugs.exim.org/show_bug.cgi?id=1208 > > General caseless matching is just too overcomplicated. Bear in mind we don't need to come up with good rules here for how this ought to work (that's a task for ES6!), all we have to do is implement what's already specified in ES5.1 correctly. :-) Comment on attachment 133729 [details] Fix View in context: https://bugs.webkit.org/attachment.cgi?id=133729&action=review > LayoutTests/fast/regex/script-tests/unicodeCaseInsensitive.js:4 > +// However it looks like DRT is loading it's source as Latin1 rather than UTF8? - so escaping. Adding a UTF-8 BOM at the beginning of this file should solve that. Comment on attachment 133729 [details] Fix Nice patch! View in context: https://bugs.webkit.org/attachment.cgi?id=133729&action=review > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:41 > + CanonicalizeRangeLo, // Value is positive delta to pair, E.g. 0x41 has value 0x20, -> 0x61. > + CanonicalizeRangeHi, // Value is positive delta to pair, E.g. 0x61 has value 0x20, -> 0x41. Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges? > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:43 > + CanonicalizeAlternatingAligned, // Aligned consequtive pair, e.g. 0x1f4,0x1f5. > + CanonicalizeAlternatingUnaligned, // Unaligned consequtive pair, e.g. 0x241,0x242. Another idea for merge: see below > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94 > + return ch & 1 ? ch + 1 : ch - 1; This could be simplified as: ((ch - 1) ^ 1) + 1 If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin This should cheap since info->begin is likely in some register. Or we could set info->value to 0 or 1 depending on the alignment. > Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges? They could, but I didn't for a few reasons. First, the table is nicely packed into 64-bit entries with a 16bit uint for the value; to really be able to represent this as a signed value, it would need to be 17bit. We could use a 16bit positive offset & ensure we wrap correctly, but this can be error prone. The extra state buys you the 17th bit. Also, sometimes we may want to know which value is greater, and we save branching to determine that this way. More importantly, I wanted a fast check for cases we can mask rather than comparing both values. For RangeLo values that is !(ch & value) && !(value & (value - 1)), for RangeHi (ch & value) && !(value & (value - 1)). We don't currently use this, because all non-ascii case-insensitive non-unique PatternCharracters are converted into CharacetrClasses, but this will enable us to handle many unicode ranges as fast as ascii in the future. > > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94 > > + return ch & 1 ? ch + 1 : ch - 1; > > This could be simplified as: ((ch - 1) ^ 1) + 1 I'd decided to stick to the simplest most readable form - premature optimization tends to lead to convoluted and less maintainable code - but you're right, seeing this written out, it's pretty clear - I think I'll go with this. > If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin > This should cheap since info->begin is likely in some register. > Or we could set info->value to 0 or 1 depending on the alignment. This kind of bit-twiddling should really be reserved for places it actually makes something go measurably faster. :-) Bear in mind, these conversions should largely be being performed at compile time, and there higher level optimizations we're missing for unicode case-insensitive compared that are likely to have much higher impact (e.g., don't convert them all into CharacterClasses when building the AST!) cheers, G. Created attachment 133860 [details]
Fix
Fixed in r112143 This change caused fast/regex/unicodeCaseInsensitive.html to start failing, please see https://bugs.webkit.org/show_bug.cgi?id=82259 |