RESOLVED FIXED 13138
StringImpl::isLower calls |islower| for non-ASCII characters
https://bugs.webkit.org/show_bug.cgi?id=13138
Summary StringImpl::isLower calls |islower| for non-ASCII characters
Jungshik Shin
Reported 2007-03-20 16:50:31 PDT
StringImpl::isLower calls |islower| for non-ASCII characters. This leads to an assertion in debug build. It can be easily avoided, but it'd sacrifice performance for ASCII strings. In StringImpl.cpp, there are a few similar cases.
Attachments
path (1.53 KB, patch)
2007-03-24 13:34 PDT, Jungshik Shin
darin: review-
patch with mistakes fixed (1.58 KB, patch)
2007-03-24 19:54 PDT, Jungshik Shin
darin: review-
patch 3 : addressing Darin's concerns (1.66 KB, patch)
2007-03-24 20:16 PDT, Jungshik Shin
darin: review-
patch with changelog entry (2.34 KB, patch)
2007-03-28 11:07 PDT, Jungshik Shin
darin: review+
Alexey Proskuryakov
Comment 1 2007-03-20 22:24:22 PDT
Reporter, did you mean that this causes an assertion failure in debug builds of WebKit on Mac OS X? What line does it happen on? How can one reproduce the problem?
Jungshik Shin
Comment 2 2007-03-22 16:50:09 PDT
Different C libraries differ in the way they treat values beyond 0xFF. The man page of islower on Mac OS 10.4 has the following to say about it. COMPATIBILITY The 4.4BSD extension of accepting arguments outside of the range of the unsigned char type in locales with large character sets is considered obsolete and may not be supported in future releases. The iswlower() function should be used instead. --------- Because the code in question is XP, I think we should not try to squeeze out every bit of performance by being TOO clever, which may lead to a compatibility issue. There are a couple of other places that use a similar trick in StringImpl.cpp. I'll upload a patch soon.
Jungshik Shin
Comment 3 2007-03-24 13:34:26 PDT
Jungshik Shin
Comment 4 2007-03-24 13:37:01 PDT
Comment on attachment 13799 [details] path this patch 'fixes' both isLower and lower. This is still broken for non-BMP characters, though. We may just consider calling the corresponding ICU functions because ICU folks must have done a good job at both optimization and 'correctness'.
Darin Adler
Comment 5 2007-03-24 19:47:15 PDT
(In reply to comment #4) > (From update of attachment 13799 [details] [edit]) > this patch 'fixes' both isLower and lower. > > This is still broken for non-BMP characters, though. We may just consider > calling the corresponding ICU functions because ICU folks must have done a good > job at both optimization and 'correctness'. There are two different operations. One is "is lowercase". For that we should use the Unicode functions in WTF, which call through to ICU. The other is "is lowercase ASCII". There are lots of places that need this operation. I believe that if you look at the clients of StringImpl::isLower you'll see that "is all lowercase ASCII" is what almost all those clients want. So we might want to rename it. There's no reason to use the more-general much slower "is lowercase" operation. Contrary to what you say here, the ICU "islower" operation is much much slower and shows up on profiles when we use it excessively. Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls <ctype.h> and expects non-standard behavior of returning false for non-ASCII characters. We should fix all of that code. Not just StringImpl::isLower.
Darin Adler
Comment 6 2007-03-24 19:48:29 PDT
(In reply to comment #2) > Because the code in question is XP, I think we should not try to squeeze out > every bit of performance by being TOO clever, which may lead to a compatibility > issue. I don't agree with this. We need to squeeze out every bit of performance. There's no reason we can't have both good cross-platform compatibility and great performance. That should be our goal here.
Darin Adler
Comment 7 2007-03-24 19:53:28 PDT
Comment on attachment 13799 [details] path This is not acceptable as-is. We need the high speed loop for all ASCII and the loop can't have a branch in it or it won't be high speed. I suggest simply and-ing with 0x7F before calling islower. We will ignore the result of islower if the characters are not all ASCII. allLower &= islower(c & 0x7F); data[i] = tolower(c & 0x7F); We use "FIXME:", not "XXX" for comments like that one about non-BMP. + // XXX Still broken for non-BMP characters represented in surrogate pairs What non-BMP characters represented as surrogate pairs are lowercase? I don't believe there are any. If there are, you should add a regression test to demonstrate the symptom of the incorrect logic.
Jungshik Shin
Comment 8 2007-03-24 19:54:12 PDT
Created attachment 13806 [details] patch with mistakes fixed there were a couple of mistakes (typo-like) in the previous patch. They're fixed in this patch.
Darin Adler
Comment 9 2007-03-24 20:01:44 PDT
Comment on attachment 13806 [details] patch with mistakes fixed review- for the same reasons mentioned in comment #7
Jungshik Shin
Comment 10 2007-03-24 20:16:20 PDT
Created attachment 13807 [details] patch 3 : addressing Darin's concerns Ok. I made changes you suggested. I also applied the same technique to EqualIgnoringCase. As for non-BMP charcters with case distinctions, there are two groups excluding 'language tagging characters' (at U+10xxxx) nobody is likely to use. One is Deseret alphabet and the other is Mathematical symbols. http://www.unicode.org/charts/PDF/U10400.pdf http://www.unicode.org/charts/PDF/U1D400.pdf
Jungshik Shin
Comment 11 2007-03-24 20:17:27 PDT
Comment on attachment 13807 [details] patch 3 : addressing Darin's concerns asking for review. Sorry I didn't read your review comment when uploading my 2nd patch.
Jungshik Shin
Comment 12 2007-03-24 20:21:19 PDT
BTW, is the second argument for equalIgnoringCase(const StringImple*, const char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? In that case, we need to check its 'ASCIIness' as well.
Jungshik Shin
Comment 13 2007-03-24 20:52:51 PDT
(In reply to comment #5) Thank you for your comments. I also stand corrected about ICU's islower. > The other is "is lowercase ASCII". There are lots of places that need this > operation. I believe that if you look at the clients of StringImpl::isLower > you'll see that "is all lowercase ASCII" is what almost all those clients want. > So we might want to rename it. There's no reason to use the more-general much > slower "is lowercase" operation. I'm well aware that there are numerous cases in web browsers where we need fast ASCII-only lowercasing and case-insensitive comparison/match. (BTW, it needs to be noted that assuming all-ASCII on call-sites are *incorrect* in some cases.) That being said, I agree that we may as well have a separate API for an ASCII-only string : isLowerASCII > Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls > <ctype.h> and expects non-standard behavior of returning false for non-ASCII > characters. We should fix all of that code. Not just StringImpl::isLower. Yes. bug 12830 is one of them.
Darin Adler
Comment 14 2007-03-27 21:53:01 PDT
Comment on attachment 13807 [details] patch 3 : addressing Darin's concerns Change looks great, r=me. Setting review- because it doesn't include a change log entry.
Darin Adler
Comment 15 2007-03-27 21:54:21 PDT
(In reply to comment #12) > BTW, is the second argument for equalIgnoringCase(const StringImple*, const > char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? > In that case, we need to check its 'ASCIIness' as well. It's only for ASCII strings; maybe it would be good to have debug-only code that asserts it's all ASCII so at least we'd find out at runtime if someone misuses it.
Jungshik Shin
Comment 16 2007-03-28 11:07:40 PDT
Created attachment 13846 [details] patch with changelog entry Thanks, Darin. I added a changelog entry. Can you check this in if there isn't any more porblem?
Darin Adler
Comment 17 2007-03-28 11:26:49 PDT
Comment on attachment 13846 [details] patch with changelog entry r=me
Darin Adler
Comment 18 2007-03-28 15:09:13 PDT
Committed revision 20559.
Note You need to log in before you can comment on or make changes to this bug.