Bug 82063 - Greek sigma is handled wrong in case independent regexp.
Summary: Greek sigma is handled wrong in case independent regexp.
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Gavin Barraclough
URL: https://mail.mozilla.org/pipermail/es...
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-23 10:18 PDT by Erik Corry
Modified: 2012-03-26 16:30 PDT (History)
6 users (show)

See Also:


Attachments
Fix (79.43 KB, patch)
2012-03-25 22:03 PDT, Gavin Barraclough
no flags Details | Formatted Diff | Diff
Fix (79.43 KB, patch)
2012-03-26 12:14 PDT, Gavin Barraclough
oliver: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Corry 2012-03-23 10:18:17 PDT
Greek sigma is handled wrong in case independent regexp.

/ΣΤΙΓΜΑΣ/iu.test("στιγμας") should be true.

See also https://mail.mozilla.org/pipermail/es-discuss/2012-March/021631.html

tested on Version 5.1.2 (6534.52.7)
Comment 1 Alexey Proskuryakov 2012-03-23 11:59:14 PDT
The "u" flag appears to be proposed in <http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/>. But this bug should not depend on it, as far as I understand it.

ς is lowercase variant of Σ when used in final position in a word. In other positions, it's σ.

I'm not quite sure what the requested behavior is. Should /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")?
Comment 2 Zoltan Herczeg 2012-03-23 12:52:17 PDT
We had a long discussion about it in PCRE:
http://bugs.exim.org/show_bug.cgi?id=1208

General caseless matching is just too overcomplicated.
Comment 3 Erik Corry 2012-03-25 12:28:27 PDT
> /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")

Both examples return true in V8, and I believe that is the correct behaviour.
Comment 4 Gavin Barraclough 2012-03-25 22:03:44 PDT
Created attachment 133729 [details]
Fix
Comment 5 Gavin Barraclough 2012-03-25 22:06:29 PDT
(In reply to comment #2)
> We had a long discussion about it in PCRE:
> http://bugs.exim.org/show_bug.cgi?id=1208
> 
> General caseless matching is just too overcomplicated.

Bear in mind we don't need to come up with good rules here for how this ought to work (that's a task for ES6!), all we have to do is implement what's already specified in ES5.1 correctly. :-)
Comment 6 Alexey Proskuryakov 2012-03-25 23:26:45 PDT
Comment on attachment 133729 [details]
Fix

View in context: https://bugs.webkit.org/attachment.cgi?id=133729&action=review

> LayoutTests/fast/regex/script-tests/unicodeCaseInsensitive.js:4
> +// However it looks like DRT is loading it's source as Latin1 rather than UTF8? - so escaping.

Adding a UTF-8 BOM at the beginning of this file should solve that.
Comment 7 Zoltan Herczeg 2012-03-26 00:07:24 PDT
Comment on attachment 133729 [details]
Fix

Nice patch!

View in context: https://bugs.webkit.org/attachment.cgi?id=133729&action=review

> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:41
> +    CanonicalizeRangeLo,              // Value is positive delta to pair, E.g. 0x41 has value 0x20, -> 0x61.
> +    CanonicalizeRangeHi,              // Value is positive delta to pair, E.g. 0x61 has value 0x20, -> 0x41.

Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges?

> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:43
> +    CanonicalizeAlternatingAligned,   // Aligned consequtive pair, e.g. 0x1f4,0x1f5.
> +    CanonicalizeAlternatingUnaligned, // Unaligned consequtive pair, e.g. 0x241,0x242.

Another idea for merge: see below

> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94
> +        return ch & 1 ? ch + 1 : ch - 1;

This could be simplified as: ((ch - 1) ^ 1) + 1

If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin
This should cheap since info->begin is likely in some register.
Or we could set info->value to 0 or 1 depending on the alignment.
Comment 8 Gavin Barraclough 2012-03-26 12:10:39 PDT
> Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges?

They could, but I didn't for a few reasons.  First, the table is nicely packed into 64-bit entries with a 16bit uint for the value; to really be able to represent this as a signed value, it would need to be 17bit.  We could use a 16bit positive offset & ensure we wrap correctly, but this can be error prone.  The extra state buys you the 17th bit.  Also, sometimes we may want to know which value is greater, and we save branching to determine that this way.

More importantly, I wanted a fast check for cases we can mask rather than comparing both values.  For RangeLo values that is !(ch & value) && !(value & (value - 1)), for RangeHi (ch & value) && !(value & (value - 1)).  We don't currently use this, because all non-ascii case-insensitive non-unique PatternCharracters are converted into CharacetrClasses, but this will enable us to handle many unicode ranges as fast as ascii in the future.

> > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94
> > +        return ch & 1 ? ch + 1 : ch - 1;
> 
> This could be simplified as: ((ch - 1) ^ 1) + 1

I'd decided to stick to the simplest most readable form - premature optimization tends to lead to convoluted and less maintainable code - but you're right, seeing this written out, it's pretty clear - I think I'll go with this.

> If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin
> This should cheap since info->begin is likely in some register.
> Or we could set info->value to 0 or 1 depending on the alignment.

This kind of bit-twiddling should really be reserved for places it actually makes something go measurably faster. :-)  Bear in mind, these conversions should largely be being performed at compile time, and there higher level optimizations we're missing for unicode case-insensitive compared that are likely to have much higher impact (e.g., don't convert them all into CharacterClasses when building the AST!)

cheers,
G.
Comment 9 Gavin Barraclough 2012-03-26 12:14:49 PDT
Created attachment 133860 [details]
Fix
Comment 10 Gavin Barraclough 2012-03-26 13:13:54 PDT
Fixed in r112143
Comment 11 Emil A Eklund 2012-03-26 16:30:33 PDT
This change caused fast/regex/unicodeCaseInsensitive.html to start failing, please see https://bugs.webkit.org/show_bug.cgi?id=82259