15519 – eliminate use of <ctype.h> for processing ASCII

Darin Adler

Reported 2007-10-15 11:18:44 PDT

Since <ctype.h> can give locale-specific results, it's not good to use it in WebKit. Also, it's not necessarily fast or efficient enough. We'll use our own instead.

Darin Adler

Comment 1 2007-10-15 17:04:43 PDT

Committed revision 26638.

Darin Adler

Comment 2 2007-10-15 17:59:50 PDT

(In reply to comment #1) > Committed revision 26638. Oops. That was for another bug.

Darin Adler

Comment 3 2007-10-15 18:13:10 PDT

Created attachment 16683 [details] patch

Eric Seidel (no email)

Comment 4 2007-10-15 21:45:18 PDT

Comment on attachment 16683 [details] patch I'm a big fan of this change. Those method names are so much cleaner than the ctype ones, and it's nice to get rid of the ctype include hack. I was a little surprised you didn't use template functions to implement the various copy/paste char variants. I would mark this r+ but I didn't stare at the individual is* functions long enough to determine if their actually 100% correct. I expect they are, but I don't trust my brain at this hour.

Geoffrey Garen

Comment 5 2007-10-16 10:59:10 PDT

- if (isalpha(*p) && toupper(*p) != *p) + if (isASCIIAlpha(*p) && toASCIILower(*p) != *p) lstyle = 0; As an EBay reviewer once wrote, "BAD BAD FRAUD."

Darin Adler

Comment 6 2007-10-16 11:15:22 PDT

(In reply to comment #5) > - if (isalpha(*p) && toupper(*p) != *p) > + if (isASCIIAlpha(*p) && toASCIILower(*p) != *p) > lstyle = 0; > > As an EBay reviewer once wrote, "BAD BAD FRAUD." Oops! I wonder why this one didn't show up as a test failure. I'll look into that.

Maciej Stachowiak

Comment 7 2007-10-16 11:28:56 PDT

Comment on attachment 16683 [details] patch Would it be reasonable to make these template functions instead of overloads: +inline bool isASCIIAlpha(char c) { return (c | 0x20) >= 'a' && (c | 0x20) <= 'z'; } +inline bool isASCIIAlpha(unsigned short c) { return (c | 0x20) >= 'a' && (c | 0x20) <= 'z'; } +inline bool isASCIIAlpha(wchar_t c) { return (c | 0x20) >= 'a' && (c | 0x20) <= 'z'; } Perhaps DisallowCType.h or ASCIICType.h should have a comment explaining why ASCIICType functions are preferred to regular ctype functions. r=me anyway but please consider these suggestions.

Geoffrey Garen

Comment 8 2007-10-16 11:40:40 PDT

I verified with a table that the ASCII functions are correct. I agree with Maciej that a better explanation / compiler error would help a developer who run into it down the line. 5948 } else if (c == ' ' || ispunct(c)) { 5949 } else if (c == ' ' || c >= 0x21 && c <= 0x2F || c >= 0x3A && c <= 0x40 || c >= 0x5B && c <= 0x60 || c >= 0x7B && c <= 0x7D) { 59495950 Why not provide isASCIIPunct instead? For ASCIICType.h, I think the preferred style is to indent inside a namespace in header files. Also, why so much duplicated code? For instance, why doesn't isASCIIAlphanumeric call isASCIIAlpha? r=me

Geoffrey Garen

Comment 9 2007-10-16 11:40:52 PDT

I verified with a table that the ASCII functions are correct. I agree with Maciej that a better explanation / compiler error would help a developer who run into it down the line. 5948 } else if (c == ' ' || ispunct(c)) { 5949 } else if (c == ' ' || c >= 0x21 && c <= 0x2F || c >= 0x3A && c <= 0x40 || c >= 0x5B && c <= 0x60 || c >= 0x7B && c <= 0x7D) { 59495950 Why not provide isASCIIPunct instead? For ASCIICType.h, I think the preferred style is to indent inside a namespace in header files. Also, why so much duplicated code? For instance, why doesn't isASCIIAlphanumeric call isASCIIAlpha? r=me

Darin Adler

Comment 10 2007-10-16 12:06:57 PDT

(In reply to comment #9) > - } else if (c == ' ' || ispunct(c)) { > + } else if (c == ' ' || c >= 0x21 && c <= 0x2F || c >= 0x3A && c > <= 0x40 || c >= 0x5B && c <= 0x60 || c >= 0x7B && c <= 0x7D) { > > Why not provide isASCIIPunct instead? To me, this category seems far less definitive than the others. And the rule is specifically about Mac OS X behavior. I suspect that ispunct isn't even the correct rule; at some point we could visit this code and get the rule exactly right for all sorts of keyboards. I don't think it's correct to assume that the only punctuation you could type on a keyboard is ASCII punctuation. I didn't want to dignify the existing rule with a name, but I also didn't want to change behavior. > For ASCIICType.h, I think the preferred style is to indent inside a namespace > in header files. Hyatt and Maciej have been debating this point. I could go either way. > Also, why so much duplicated code? For instance, why doesn't > isASCIIAlphanumeric call isASCIIAlpha? Hard to say exactly why. I like the fact that each of these lines stands alone. I expect that any future changes to this file would primarily be for performance optimization, perhaps to remove some of these functions if we find they aren't really useful, or maybe to add a few more.

Darin Adler

Comment 11 2007-10-16 12:07:59 PDT

(In reply to comment #7) > (From update of attachment 16683 [details] [edit]) > Would it be reasonable to make these template functions instead of overloads: > > +inline bool isASCIIAlpha(char c) { return (c | 0x20) >= 'a' && (c | 0x20) <= > 'z'; } > +inline bool isASCIIAlpha(unsigned short c) { return (c | 0x20) >= 'a' && (c | > 0x20) <= 'z'; } > +inline bool isASCIIAlpha(wchar_t c) { return (c | 0x20) >= 'a' && (c | 0x20) > <= 'z'; } Maybe. Template functions are more annoying when debugging. I'm feeling slightly conflicted about it, and I think I'll leave it as-is for now.

Darin Adler

Comment 12 2007-10-16 13:26:56 PDT

Committed revision 26676.

Attachments
patch (80.44 KB, patch) 2007-10-15 18:13 PDT, Darin Adler	mjs: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.