Bug 13138

Summary: StringImpl::isLower calls |islower| for non-ASCII characters
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: WebCore Misc.Assignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, darin
Priority: P2    
Version: 523.x (Safari 3)   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
path
darin: review-
patch with mistakes fixed
darin: review-
patch 3 : addressing Darin's concerns
darin: review-
patch with changelog entry darin: review+

Description Jungshik Shin 2007-03-20 16:50:31 PDT
StringImpl::isLower calls |islower| for non-ASCII characters. This leads to an assertion in debug build. 

It can be easily avoided, but it'd sacrifice performance for ASCII strings. 

In StringImpl.cpp, there are a few similar cases.
Comment 1 Alexey Proskuryakov 2007-03-20 22:24:22 PDT
Reporter, did you mean that this causes an assertion failure in debug builds of WebKit on Mac OS X? What line does it happen on? How can one reproduce the problem?
Comment 2 Jungshik Shin 2007-03-22 16:50:09 PDT
Different C libraries differ in the way they treat values beyond 0xFF. The man page of islower on Mac OS 10.4 has the following to say about it. 



COMPATIBILITY
     The 4.4BSD extension of accepting arguments outside of the range of the
     unsigned char type in locales with large character sets is considered
     obsolete and may not be supported in future releases.  The iswlower()
     function should be used instead.

---------

Because the code in question is XP, I think we should not try to squeeze out every bit of performance by being TOO clever, which may lead to a compatibility issue.  

There are a couple of other places that use a similar trick in StringImpl.cpp. I'll upload a patch soon. 
Comment 3 Jungshik Shin 2007-03-24 13:34:26 PDT
Created attachment 13799 [details]
path
Comment 4 Jungshik Shin 2007-03-24 13:37:01 PDT
Comment on attachment 13799 [details]
path

this patch 'fixes' both isLower and lower. 

This is still broken for non-BMP characters, though. We may just consider calling the corresponding ICU functions because ICU folks must have done a good job at both optimization and 'correctness'.
Comment 5 Darin Adler 2007-03-24 19:47:15 PDT
(In reply to comment #4)
> (From update of attachment 13799 [details] [edit])
> this patch 'fixes' both isLower and lower. 
> 
> This is still broken for non-BMP characters, though. We may just consider
> calling the corresponding ICU functions because ICU folks must have done a good
> job at both optimization and 'correctness'. 

There are two different operations. One is "is lowercase". For that we should use the Unicode functions in WTF, which call through to ICU.

The other is "is lowercase ASCII". There are lots of places that need this operation. I believe that if you look at the clients of StringImpl::isLower you'll see that "is all lowercase ASCII" is what almost all those clients want. So we might want to rename it. There's no reason to use the more-general much slower "is lowercase" operation.

Contrary to what you say here, the ICU "islower" operation is much much slower and shows up on profiles when we use it excessively.

Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls <ctype.h> and expects non-standard behavior of returning false for non-ASCII characters. We should fix all of that code. Not just StringImpl::isLower.
Comment 6 Darin Adler 2007-03-24 19:48:29 PDT
(In reply to comment #2)
> Because the code in question is XP, I think we should not try to squeeze out
> every bit of performance by being TOO clever, which may lead to a compatibility
> issue.  

I don't agree with this. We need to squeeze out every bit of performance. There's no reason we can't have both good cross-platform compatibility and great performance. That should be our goal here.
Comment 7 Darin Adler 2007-03-24 19:53:28 PDT
Comment on attachment 13799 [details]
path

This is not acceptable as-is. We need the high speed loop for all ASCII and the loop can't have a branch in it or it won't be high speed. I suggest simply and-ing with 0x7F before calling islower. We will ignore the result of islower if the characters are not all ASCII.

    allLower &= islower(c & 0x7F);
    data[i] = tolower(c & 0x7F);


We use "FIXME:", not "XXX" for comments like that one about non-BMP.

+    // XXX Still broken for non-BMP characters represented in surrogate pairs

What non-BMP characters represented as surrogate pairs are lowercase? I don't believe there are any. If there are, you should add a regression test to demonstrate the symptom of the incorrect logic.
Comment 8 Jungshik Shin 2007-03-24 19:54:12 PDT
Created attachment 13806 [details]
patch with mistakes fixed

there were a couple of mistakes (typo-like) in the previous patch. They're fixed in this patch.
Comment 9 Darin Adler 2007-03-24 20:01:44 PDT
Comment on attachment 13806 [details]
patch with mistakes fixed

review- for the same reasons mentioned in comment #7
Comment 10 Jungshik Shin 2007-03-24 20:16:20 PDT
Created attachment 13807 [details]
patch 3 : addressing Darin's concerns

Ok. I made changes you suggested. I also applied the same technique to EqualIgnoringCase. 

As for non-BMP charcters with case distinctions, there are two groups excluding 'language tagging characters' (at U+10xxxx) nobody is likely to use. One is Deseret alphabet and the other is Mathematical symbols. 


http://www.unicode.org/charts/PDF/U10400.pdf
http://www.unicode.org/charts/PDF/U1D400.pdf
Comment 11 Jungshik Shin 2007-03-24 20:17:27 PDT
Comment on attachment 13807 [details]
patch 3 : addressing Darin's concerns

asking for review.
Sorry I didn't read your review comment when uploading my 2nd patch.
Comment 12 Jungshik Shin 2007-03-24 20:21:19 PDT
BTW, is the second argument for equalIgnoringCase(const StringImple*, const char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding? In that case, we need to check its 'ASCIIness' as well. 
Comment 13 Jungshik Shin 2007-03-24 20:52:51 PDT
(In reply to comment #5)

Thank you for your comments. I also stand corrected about ICU's islower. 

> The other is "is lowercase ASCII". There are lots of places that need this
> operation. I believe that if you look at the clients of StringImpl::isLower
> you'll see that "is all lowercase ASCII" is what almost all those clients want.
> So we might want to rename it. There's no reason to use the more-general much
> slower "is lowercase" operation.

I'm well aware that there are numerous cases in web browsers where we need fast ASCII-only lowercasing and case-insensitive comparison/match. (BTW, it needs to be noted that assuming all-ASCII on call-sites are *incorrect* in some cases.)  That being said, I agree that we may as well have a separate API for an ASCII-only string : isLowerASCII   

 
> Now as to the misuse of <ctype.h>, it's true, there's a lot of code that calls
> <ctype.h> and expects non-standard behavior of returning false for non-ASCII
> characters. We should fix all of that code. Not just StringImpl::isLower.

Yes. bug 12830 is one of them. 

Comment 14 Darin Adler 2007-03-27 21:53:01 PDT
Comment on attachment 13807 [details]
patch 3 : addressing Darin's concerns

Change looks great, r=me. Setting review- because it doesn't include a change log entry.
Comment 15 Darin Adler 2007-03-27 21:54:21 PDT
(In reply to comment #12)
> BTW, is the second argument for equalIgnoringCase(const StringImple*, const
> char*) only ASCII or can it be Latin-1 or whatever locale-dependent encoding?
> In that case, we need to check its 'ASCIIness' as well. 

It's only for ASCII strings; maybe it would be good to have debug-only code that asserts it's all ASCII so at least we'd find out at runtime if someone misuses it.
Comment 16 Jungshik Shin 2007-03-28 11:07:40 PDT
Created attachment 13846 [details]
patch with changelog entry

Thanks, Darin. 
I added a changelog entry. Can you check this in if there isn't any more porblem?
Comment 17 Darin Adler 2007-03-28 11:26:49 PDT
Comment on attachment 13846 [details]
patch with changelog entry

r=me
Comment 18 Darin Adler 2007-03-28 15:09:13 PDT
Committed revision 20559.