Bug 13138

Summary:

StringImpl::isLower calls |islower| for non-ASCII characters

Product:

WebKit

Reporter:

Jungshik Shin <jshin>

Component:

WebCore Misc.

Assignee:

Nobody <webkit-unassigned>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, darin

Priority:

Version:

523.x (Safari 3)

Hardware:

Mac

OS:

OS X 10.4

Attachments:

Description	Flags
path	darin: review-
patch with mistakes fixed	darin: review-
patch 3 : addressing Darin's concerns	darin: review-
patch with changelog entry	darin: review+

Jungshik Shin

Reported 2007-03-20 16:50:31 PDT

StringImpl::isLower calls |islower| for non-ASCII characters. This leads to an assertion in debug build. It can be easily avoided, but it'd sacrifice performance for ASCII strings. In StringImpl.cpp, there are a few similar cases.

Attachments
path (1.53 KB, patch) 2007-03-24 13:34 PDT, Jungshik Shin	darin: review-	Details Formatted Diff Diff
patch with mistakes fixed (1.58 KB, patch) 2007-03-24 19:54 PDT, Jungshik Shin	darin: review-	Details Formatted Diff Diff
patch 3 : addressing Darin's concerns (1.66 KB, patch) 2007-03-24 20:16 PDT, Jungshik Shin	darin: review-	Details Formatted Diff Diff
patch with changelog entry (2.34 KB, patch) 2007-03-28 11:07 PDT, Jungshik Shin	darin: review+	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2007-03-20 22:24:22 PDT

Reporter, did you mean that this causes an assertion failure in debug builds of WebKit on Mac OS X? What line does it happen on? How can one reproduce the problem?

Jungshik Shin

Comment 2 2007-03-22 16:50:09 PDT

Different C libraries differ in the way they treat values beyond 0xFF. The man page of islower on Mac OS 10.4 has the following to say about it. COMPATIBILITY The 4.4BSD extension of accepting arguments outside of the range of the unsigned char type in locales with large character sets is considered obsolete and may not be supported in future releases. The iswlower() function should be used instead. --------- Because the code in question is XP, I think we should not try to squeeze out every bit of performance by being TOO clever, which may lead to a compatibility issue. There are a couple of other places that use a similar trick in StringImpl.cpp. I'll upload a patch soon.

Jungshik Shin

Comment 3 2007-03-24 13:34:26 PDT

Created attachment 13799 [details] path

Jungshik Shin

Comment 4 2007-03-24 13:37:01 PDT

Comment on attachment 13799 [details] path this patch 'fixes' both isLower and lower. This is still broken for non-BMP characters, though. We may just consider calling the corresponding ICU functions because ICU folks must have done a good job at both optimization and 'correctness'.

Darin Adler

Comment 5 2007-03-24 19:47:15 PDT

(In reply to comment #4) > (From update of attachment 13799 [details] [edit]) > this patch 'fixes' both isLower and lower. > > This is still broken for non-BMP characters, though. We may just consider > calling the corresponding ICU functions because ICU folks must have done a good > job at both optimization and 'correctness'. There are two different operations. One is "is lowercase". For that we should use the Unicode functions in WTF, which call through to ICU. The other is "is lowercase ASCII". There are lots of places that need this operation. I believe that if you look at the clients of StringImpl::isLower you'll see that "is all lowercase ASCII" is what almost all those clients want. So we might want to rename it. There's no reason to use the more-general much slower "is lowercase" operation. Contrary to what you say here, the ICU "islower" operation is much much slower and shows up on profiles when we use it excessively. Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls <ctype.h> and expects non-standard behavior of returning false for non-ASCII characters. We should fix all of that code. Not just StringImpl::isLower.

Darin Adler

Comment 6 2007-03-24 19:48:29 PDT

(In reply to comment #2) > Because the code in question is XP, I think we should not try to squeeze out > every bit of performance by being TOO clever, which may lead to a compatibility > issue. I don't agree with this. We need to squeeze out every bit of performance. There's no reason we can't have both good cross-platform compatibility and great performance. That should be our goal here.

Darin Adler

Comment 7 2007-03-24 19:53:28 PDT

Comment on attachment 13799 [details] path This is not acceptable as-is. We need the high speed loop for all ASCII and the loop can't have a branch in it or it won't be high speed. I suggest simply and-ing with 0x7F before calling islower. We will ignore the result of islower if the characters are not all ASCII. allLower &= islower(c & 0x7F); data[i] = tolower(c & 0x7F); We use "FIXME:", not "XXX" for comments like that one about non-BMP. + // XXX Still broken for non-BMP characters represented in surrogate pairs What non-BMP characters represented as surrogate pairs are lowercase? I don't believe there are any. If there are, you should add a regression test to demonstrate the symptom of the incorrect logic.

Jungshik Shin

Comment 8 2007-03-24 19:54:12 PDT

Created attachment 13806 [details] patch with mistakes fixed there were a couple of mistakes (typo-like) in the previous patch. They're fixed in this patch.

Darin Adler

Comment 9 2007-03-24 20:01:44 PDT

Comment on attachment 13806 [details] patch with mistakes fixed review- for the same reasons mentioned in comment #7

Jungshik Shin

Comment 10 2007-03-24 20:16:20 PDT

Created attachment 13807 [details] patch 3 : addressing Darin's concerns Ok. I made changes you suggested. I also applied the same technique to EqualIgnoringCase. As for non-BMP charcters with case distinctions, there are two groups excluding 'language tagging characters' (at U+10xxxx) nobody is likely to use. One is Deseret alphabet and the other is Mathematical symbols. http://www.unicode.org/charts/PDF/U10400.pdf http://www.unicode.org/charts/PDF/U1D400.pdf

Jungshik Shin

Comment 11 2007-03-24 20:17:27 PDT

Comment on attachment 13807 [details] patch 3 : addressing Darin's concerns asking for review. Sorry I didn't read your review comment when uploading my 2nd patch.

Jungshik Shin

Comment 12 2007-03-24 20:21:19 PDT

BTW, is the second argument for equalIgnoringCase(const StringImple*, const char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? In that case, we need to check its 'ASCIIness' as well.

Jungshik Shin

Comment 13 2007-03-24 20:52:51 PDT

(In reply to comment #5) Thank you for your comments. I also stand corrected about ICU's islower. > The other is "is lowercase ASCII". There are lots of places that need this > operation. I believe that if you look at the clients of StringImpl::isLower > you'll see that "is all lowercase ASCII" is what almost all those clients want. > So we might want to rename it. There's no reason to use the more-general much > slower "is lowercase" operation. I'm well aware that there are numerous cases in web browsers where we need fast ASCII-only lowercasing and case-insensitive comparison/match. (BTW, it needs to be noted that assuming all-ASCII on call-sites are *incorrect* in some cases.) That being said, I agree that we may as well have a separate API for an ASCII-only string : isLowerASCII > Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls > <ctype.h> and expects non-standard behavior of returning false for non-ASCII > characters. We should fix all of that code. Not just StringImpl::isLower. Yes. bug 12830 is one of them.

Darin Adler

Comment 14 2007-03-27 21:53:01 PDT

Comment on attachment 13807 [details] patch 3 : addressing Darin's concerns Change looks great, r=me. Setting review- because it doesn't include a change log entry.

Darin Adler

Comment 15 2007-03-27 21:54:21 PDT

(In reply to comment #12) > BTW, is the second argument for equalIgnoringCase(const StringImple*, const > char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? > In that case, we need to check its 'ASCIIness' as well. It's only for ASCII strings; maybe it would be good to have debug-only code that asserts it's all ASCII so at least we'd find out at runtime if someone misuses it.

Jungshik Shin

Comment 16 2007-03-28 11:07:40 PDT

Created attachment 13846 [details] patch with changelog entry Thanks, Darin. I added a changelog entry. Can you check this in if there isn't any more porblem?

Darin Adler

Comment 17 2007-03-28 11:26:49 PDT

Comment on attachment 13846 [details] patch with changelog entry r=me

Darin Adler

Comment 18 2007-03-28 15:09:13 PDT

Committed revision 20559.

Note You need to log in before you can comment on or make changes to this bug.