Bug 82063

Summary:

Greek sigma is handled wrong in case independent regexp.

Product:

WebKit

Reporter:

Erik Corry <erikcorry>

Component:

JavaScriptCore

Assignee:

Gavin Barraclough <barraclough>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, barraclough, eae, msaboff, oliver, zherczeg

Priority:

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

URL:

https://mail.mozilla.org/pipermail/es-discuss/2012-March/021631.html

Attachments:

Description	Flags
Fix	none
Fix	oliver: review+

Erik Corry

Reported 2012-03-23 10:18:17 PDT

Greek sigma is handled wrong in case independent regexp. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") should be true. See also https://mail.mozilla.org/pipermail/es-discuss/2012-March/021631.html tested on Version 5.1.2 (6534.52.7)

Attachments
Fix (79.43 KB, patch) 2012-03-25 22:03 PDT, Gavin Barraclough	no flags	Details Formatted Diff Diff
Fix (79.43 KB, patch) 2012-03-26 12:14 PDT, Gavin Barraclough	oliver: review+	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2012-03-23 11:59:14 PDT

The "u" flag appears to be proposed in <http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/>. But this bug should not depend on it, as far as I understand it. ς is lowercase variant of Σ when used in final position in a word. In other positions, it's σ. I'm not quite sure what the requested behavior is. Should /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ")?

Zoltan Herczeg

Comment 2 2012-03-23 12:52:17 PDT

We had a long discussion about it in PCRE: http://bugs.exim.org/show_bug.cgi?id=1208 General caseless matching is just too overcomplicated.

Erik Corry

Comment 3 2012-03-25 12:28:27 PDT

> /ΔΣΔ/i.test("δςδ") be true, as well? What about /ς/i.test("σ") Both examples return true in V8, and I believe that is the correct behaviour.

Gavin Barraclough

Comment 4 2012-03-25 22:03:44 PDT

Created attachment 133729 [details] Fix

Gavin Barraclough

Comment 5 2012-03-25 22:06:29 PDT

(In reply to comment #2) > We had a long discussion about it in PCRE: > http://bugs.exim.org/show_bug.cgi?id=1208 > > General caseless matching is just too overcomplicated. Bear in mind we don't need to come up with good rules here for how this ought to work (that's a task for ES6!), all we have to do is implement what's already specified in ES5.1 correctly. :-)

Alexey Proskuryakov

Comment 6 2012-03-25 23:26:45 PDT

Comment on attachment 133729 [details] Fix View in context: https://bugs.webkit.org/attachment.cgi?id=133729&action=review > LayoutTests/fast/regex/script-tests/unicodeCaseInsensitive.js:4 > +// However it looks like DRT is loading it's source as Latin1 rather than UTF8? - so escaping. Adding a UTF-8 BOM at the beginning of this file should solve that.

Zoltan Herczeg

Comment 7 2012-03-26 00:07:24 PDT

Comment on attachment 133729 [details] Fix Nice patch! View in context: https://bugs.webkit.org/attachment.cgi?id=133729&action=review > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:41 > + CanonicalizeRangeLo, // Value is positive delta to pair, E.g. 0x41 has value 0x20, -> 0x61. > + CanonicalizeRangeHi, // Value is positive delta to pair, E.g. 0x61 has value 0x20, -> 0x41. Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges? > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:43 > + CanonicalizeAlternatingAligned, // Aligned consequtive pair, e.g. 0x1f4,0x1f5. > + CanonicalizeAlternatingUnaligned, // Unaligned consequtive pair, e.g. 0x241,0x242. Another idea for merge: see below > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94 > + return ch & 1 ? ch + 1 : ch - 1; This could be simplified as: ((ch - 1) ^ 1) + 1 If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin This should cheap since info->begin is likely in some register. Or we could set info->value to 0 or 1 depending on the alignment.

Gavin Barraclough

Comment 8 2012-03-26 12:10:39 PDT

> Just an idea: could these two be merged into one with positive value for low, and negative value for hi ranges? They could, but I didn't for a few reasons. First, the table is nicely packed into 64-bit entries with a 16bit uint for the value; to really be able to represent this as a signed value, it would need to be 17bit. We could use a 16bit positive offset & ensure we wrap correctly, but this can be error prone. The extra state buys you the 17th bit. Also, sometimes we may want to know which value is greater, and we save branching to determine that this way. More importantly, I wanted a fast check for cases we can mask rather than comparing both values. For RangeLo values that is !(ch & value) && !(value & (value - 1)), for RangeHi (ch & value) && !(value & (value - 1)). We don't currently use this, because all non-ascii case-insensitive non-unique PatternCharracters are converted into CharacetrClasses, but this will enable us to handle many unicode ranges as fast as ascii in the future. > > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h:94 > > + return ch & 1 ? ch + 1 : ch - 1; > > This could be simplified as: ((ch - 1) ^ 1) + 1 I'd decided to stick to the simplest most readable form - premature optimization tends to lead to convoluted and less maintainable code - but you're right, seeing this written out, it's pretty clear - I think I'll go with this. > If we merge the aligned and unaligned to one state: ((ch - info->begin) ^ 1) + info->begin > This should cheap since info->begin is likely in some register. > Or we could set info->value to 0 or 1 depending on the alignment. This kind of bit-twiddling should really be reserved for places it actually makes something go measurably faster. :-) Bear in mind, these conversions should largely be being performed at compile time, and there higher level optimizations we're missing for unicode case-insensitive compared that are likely to have much higher impact (e.g., don't convert them all into CharacterClasses when building the AST!) cheers, G.

Gavin Barraclough

Comment 9 2012-03-26 12:14:49 PDT

Created attachment 133860 [details] Fix

Gavin Barraclough

Comment 10 2012-03-26 13:13:54 PDT

Fixed in r112143

Emil A Eklund

Comment 11 2012-03-26 16:30:33 PDT

This change caused fast/regex/unicodeCaseInsensitive.html to start failing, please see https://bugs.webkit.org/show_bug.cgi?id=82259

Note You need to log in before you can comment on or make changes to this bug.