StringImpl::isLower calls |islower| for non-ASCII characters. This leads to an assertion in debug build. It can be easily avoided, but it'd sacrifice performance for ASCII strings. In StringImpl.cpp, there are a few similar cases.
Reporter, did you mean that this causes an assertion failure in debug builds of WebKit on Mac OS X? What line does it happen on? How can one reproduce the problem?
Different C libraries differ in the way they treat values beyond 0xFF. The man page of islower on Mac OS 10.4 has the following to say about it. COMPATIBILITY The 4.4BSD extension of accepting arguments outside of the range of the unsigned char type in locales with large character sets is considered obsolete and may not be supported in future releases. The iswlower() function should be used instead. --------- Because the code in question is XP, I think we should not try to squeeze out every bit of performance by being TOO clever, which may lead to a compatibility issue. There are a couple of other places that use a similar trick in StringImpl.cpp. I'll upload a patch soon.
Created attachment 13799 [details] path
Comment on attachment 13799 [details] path this patch 'fixes' both isLower and lower. This is still broken for non-BMP characters, though. We may just consider calling the corresponding ICU functions because ICU folks must have done a good job at both optimization and 'correctness'.
(In reply to comment #4) > (From update of attachment 13799 [details] [edit]) > this patch 'fixes' both isLower and lower. > > This is still broken for non-BMP characters, though. We may just consider > calling the corresponding ICU functions because ICU folks must have done a good > job at both optimization and 'correctness'. There are two different operations. One is "is lowercase". For that we should use the Unicode functions in WTF, which call through to ICU. The other is "is lowercase ASCII". There are lots of places that need this operation. I believe that if you look at the clients of StringImpl::isLower you'll see that "is all lowercase ASCII" is what almost all those clients want. So we might want to rename it. There's no reason to use the more-general much slower "is lowercase" operation. Contrary to what you say here, the ICU "islower" operation is much much slower and shows up on profiles when we use it excessively. Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls <ctype.h> and expects non-standard behavior of returning false for non-ASCII characters. We should fix all of that code. Not just StringImpl::isLower.
(In reply to comment #2) > Because the code in question is XP, I think we should not try to squeeze out > every bit of performance by being TOO clever, which may lead to a compatibility > issue. I don't agree with this. We need to squeeze out every bit of performance. There's no reason we can't have both good cross-platform compatibility and great performance. That should be our goal here.
Comment on attachment 13799 [details] path This is not acceptable as-is. We need the high speed loop for all ASCII and the loop can't have a branch in it or it won't be high speed. I suggest simply and-ing with 0x7F before calling islower. We will ignore the result of islower if the characters are not all ASCII. allLower &= islower(c & 0x7F); data[i] = tolower(c & 0x7F); We use "FIXME:", not "XXX" for comments like that one about non-BMP. + // XXX Still broken for non-BMP characters represented in surrogate pairs What non-BMP characters represented as surrogate pairs are lowercase? I don't believe there are any. If there are, you should add a regression test to demonstrate the symptom of the incorrect logic.
Created attachment 13806 [details] patch with mistakes fixed there were a couple of mistakes (typo-like) in the previous patch. They're fixed in this patch.
Comment on attachment 13806 [details] patch with mistakes fixed review- for the same reasons mentioned in comment #7
Created attachment 13807 [details] patch 3 : addressing Darin's concerns Ok. I made changes you suggested. I also applied the same technique to EqualIgnoringCase. As for non-BMP charcters with case distinctions, there are two groups excluding 'language tagging characters' (at U+10xxxx) nobody is likely to use. One is Deseret alphabet and the other is Mathematical symbols. http://www.unicode.org/charts/PDF/U10400.pdf http://www.unicode.org/charts/PDF/U1D400.pdf
Comment on attachment 13807 [details] patch 3 : addressing Darin's concerns asking for review. Sorry I didn't read your review comment when uploading my 2nd patch.
BTW, is the second argument for equalIgnoringCase(const StringImple*, const char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? In that case, we need to check its 'ASCIIness' as well.
(In reply to comment #5) Thank you for your comments. I also stand corrected about ICU's islower. > The other is "is lowercase ASCII". There are lots of places that need this > operation. I believe that if you look at the clients of StringImpl::isLower > you'll see that "is all lowercase ASCII" is what almost all those clients want. > So we might want to rename it. There's no reason to use the more-general much > slower "is lowercase" operation. I'm well aware that there are numerous cases in web browsers where we need fast ASCII-only lowercasing and case-insensitive comparison/match. (BTW, it needs to be noted that assuming all-ASCII on call-sites are *incorrect* in some cases.) That being said, I agree that we may as well have a separate API for an ASCII-only string : isLowerASCII > Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls > <ctype.h> and expects non-standard behavior of returning false for non-ASCII > characters. We should fix all of that code. Not just StringImpl::isLower. Yes. bug 12830 is one of them.
Comment on attachment 13807 [details] patch 3 : addressing Darin's concerns Change looks great, r=me. Setting review- because it doesn't include a change log entry.
(In reply to comment #12) > BTW, is the second argument for equalIgnoringCase(const StringImple*, const > char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? > In that case, we need to check its 'ASCIIness' as well. It's only for ASCII strings; maybe it would be good to have debug-only code that asserts it's all ASCII so at least we'd find out at runtime if someone misuses it.
Created attachment 13846 [details] patch with changelog entry Thanks, Darin. I added a changelog entry. Can you check this in if there isn't any more porblem?
Comment on attachment 13846 [details] patch with changelog entry r=me
Committed revision 20559.