WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
82063
Greek sigma is handled wrong in case independent regexp.
https://bugs.webkit.org/show_bug.cgi?id=82063
Summary
Greek sigma is handled wrong in case independent regexp.
Erik Corry
Reported
2012-03-23 10:18:17 PDT
Greek sigma is handled wrong in case independent regexp. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") should be true. See also
https://mail.mozilla.org/pipermail/es-discuss/2012-March/021631.html
tested on Version 5.1.2 (6534.52.7)
Attachments
Fix
(79.43 KB, patch)
2012-03-25 22:03 PDT
,
Gavin Barraclough
no flags
Details
Formatted Diff
Diff
Fix
(79.43 KB, patch)
2012-03-26 12:14 PDT
,
Gavin Barraclough
oliver
: review+
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2012-03-23 11:59:14 PDT
The "u" flag appears to be proposed in <
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
>. But this bug should not depend on it, as far as I understand it. ς is lowercase variant of Σ when used in final position in a word. In other positions, it's σ. I'm not quite sure what the requested behavior is. Should /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")?
Zoltan Herczeg
Comment 2
2012-03-23 12:52:17 PDT
We had a long discussion about it in PCRE:
http://bugs.exim.org/show_bug.cgi?id=1208
General caseless matching is just too overcomplicated.
Erik Corry
Comment 3
2012-03-25 12:28:27 PDT
> /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")
Both examples return true in V8, and I believe that is the correct behaviour.
Gavin Barraclough
Comment 4
2012-03-25 22:03:44 PDT
Created
attachment 133729
[details]
Fix
Gavin Barraclough
Comment 5
2012-03-25 22:06:29 PDT
(In reply to
comment #2
)
> We had a long discussion about it in PCRE: >
http://bugs.exim.org/show_bug.cgi?id=1208
> > General caseless matching is just too overcomplicated.
Bear in mind we don't need to come up with good rules here for how this ought to work (that's a task for ES6!), all we have to do is implement what's already specified in ES5.1 correctly. :-)
Alexey Proskuryakov
Comment 6
2012-03-25 23:26:45 PDT
Comment on
attachment 133729
[details]
Fix View in context:
https://bugs.webkit.org/attachment.cgi?id=133729&action=review
> LayoutTests/fast/regex/script-tests/unicodeCaseInsensitive.js:4 > +// However it looks like DRT is loading it's source as Latin1 rather than UTF8? - so escaping.
Adding a UTF-8 BOM at the beginning of this file should solve that.
Zoltan Herczeg
Comment 7
2012-03-26 00:07:24 PDT
Comment on
attachment 133729
[details]
Fix Nice patch! View in context:
https://bugs.webkit.org/attachment.cgi?id=133729&action=review
> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:41 > + CanonicalizeRangeLo, // Value is positive delta to pair, E.g. 0x41 has value 0x20, -> 0x61. > + CanonicalizeRangeHi, // Value is positive delta to pair, E.g. 0x61 has value 0x20, -> 0x41.
Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges?
> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:43 > + CanonicalizeAlternatingAligned, // Aligned consequtive pair, e.g. 0x1f4,0x1f5. > + CanonicalizeAlternatingUnaligned, // Unaligned consequtive pair, e.g. 0x241,0x242.
Another idea for merge: see below
> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94 > + return ch & 1 ? ch + 1 : ch - 1;
This could be simplified as: ((ch - 1) ^ 1) + 1 If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin This should cheap since info->begin is likely in some register. Or we could set info->value to 0 or 1 depending on the alignment.
Gavin Barraclough
Comment 8
2012-03-26 12:10:39 PDT
> Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges?
They could, but I didn't for a few reasons. First, the table is nicely packed into 64-bit entries with a 16bit uint for the value; to really be able to represent this as a signed value, it would need to be 17bit. We could use a 16bit positive offset & ensure we wrap correctly, but this can be error prone. The extra state buys you the 17th bit. Also, sometimes we may want to know which value is greater, and we save branching to determine that this way. More importantly, I wanted a fast check for cases we can mask rather than comparing both values. For RangeLo values that is !(ch & value) && !(value & (value - 1)), for RangeHi (ch & value) && !(value & (value - 1)). We don't currently use this, because all non-ascii case-insensitive non-unique PatternCharracters are converted into CharacetrClasses, but this will enable us to handle many unicode ranges as fast as ascii in the future.
> > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94 > > + return ch & 1 ? ch + 1 : ch - 1; > > This could be simplified as: ((ch - 1) ^ 1) + 1
I'd decided to stick to the simplest most readable form - premature optimization tends to lead to convoluted and less maintainable code - but you're right, seeing this written out, it's pretty clear - I think I'll go with this.
> If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin > This should cheap since info->begin is likely in some register. > Or we could set info->value to 0 or 1 depending on the alignment.
This kind of bit-twiddling should really be reserved for places it actually makes something go measurably faster. :-) Bear in mind, these conversions should largely be being performed at compile time, and there higher level optimizations we're missing for unicode case-insensitive compared that are likely to have much higher impact (e.g., don't convert them all into CharacterClasses when building the AST!) cheers, G.
Gavin Barraclough
Comment 9
2012-03-26 12:14:49 PDT
Created
attachment 133860
[details]
Fix
Gavin Barraclough
Comment 10
2012-03-26 13:13:54 PDT
Fixed in
r112143
Emil A Eklund
Comment 11
2012-03-26 16:30:33 PDT
This change caused fast/regex/unicodeCaseInsensitive.html to start failing, please see
https://bugs.webkit.org/show_bug.cgi?id=82259
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug