Bug 7852 - New unicode abstraction layer
Summary: New unicode abstraction layer
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Other Linux
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 7383
  Show dependency treegraph
 
Reported: 2006-03-18 16:19 PST by George Staikos
Modified: 2006-04-03 16:46 PDT (History)
3 users (show)

See Also:


Attachments
My proposed solution (5.39 KB, application/octet-stream)
2006-03-18 16:21 PST, George Staikos
mjs: review-
Details
Updated patch (17.03 KB, patch)
2006-04-02 22:30 PDT, George Staikos
mjs: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description George Staikos 2006-03-18 16:19:13 PST
A unicode abstraction layer is required to break the libicu dependency.  This is a heavy penalty for some platforms.
Comment 1 George Staikos 2006-03-18 16:21:31 PST
Created attachment 7161 [details]
My proposed solution
Comment 2 Darin Adler 2006-03-18 19:34:42 PST
Comment on attachment 7161 [details]
My proposed solution

Looks like a fine approach to me.
Comment 3 Alexey Proskuryakov 2006-03-18 22:49:13 PST
Something that doesn't look right to me is that this artificially limits Unicode support to UCS-2:

inline unsigned short toLower(unsigned short c) {
  return static_cast<unsigned short>(u_tolower(uc));
}

Admittedly, exisiting code also uses unsigned shorts, see bug 4920.
Comment 4 George Staikos 2006-03-19 05:21:29 PST
Any suggestions on how to change it?  I didn't see any use of it that this wouldn't serve so far, and it's not like we can't change this after... it's internal to KJS anyway.
Comment 5 Alexey Proskuryakov 2006-03-19 06:20:56 PST
I think that a real API for tolower/toupper can only work with complete strings, not individual characters (there are some evil examples in bug 6310).

Could the following work?

1) KJS::Unicode::isFormatChar() and KJS::Unicode::isSeparatorSpace() take uint32_t parameters (complete Unicode code point in UTF-32 encoding). This won't change their existing uses, and will still be easily implementable with Qt, because format chars and spaces are all in the first Unicode plane AFAIK.
2) KJS::Unicode::toLower() and KJS::Unicode::toUpper work on unsigned short (UTF-16) buffers, with implementation details hidden behind the API. They should be able to change the length of the string.
3) UChar::toUpper() and UChar::toLower() are removed altogether (it doesn't look like they are actually used for anything but strings).

I think that this abstraction may eventially become useful in WebCore, too.
Comment 6 Maciej Stachowiak 2006-03-21 01:20:54 PST
It's nice that the API layer came out so thin.

I don't think George needs to perfect the abstraction in all ways before landing. But there are the following issues, I believe:

* The patch shows an include of one of the platform-specific headers. I think it would be better to include a single header, which uses PLATFORM macros to pick the right per-platform implementation.

+#include "unicode/qt4/kjs_unicode.h"

* This method is not implemented in the ICU code path:

inline CharCategory category(unsigned short c)

* I think it would be better to put this in kxmlcore instead of kjs, so it can be used in the future by WebCore and/or khtml.

I think enhancing the API to go beyond KJS's current uses can be done in a separate pass, but these issues seem relatively important.

r- for now because I think the included patch would break the JavaScriptCore build and we need to straighten out the header issue.
Comment 7 George Staikos 2006-04-02 22:30:42 PDT
Created attachment 7478 [details]
Updated patch
Comment 8 Darin Adler 2006-04-03 08:53:39 PDT
Comment on attachment 7478 [details]
Updated patch

+    } while (KJS::Unicode::isFormatChar(next3));

You should remove the KJS:: prefixes in all these places, because the code is already in the KJS prefix.

+      return static_cast<unsigned short>(u_tolower(c));

+      return static_cast<unsigned short>(u_toupper(c));

No need for the type casts here.
Comment 9 George Staikos 2006-04-03 09:19:13 PDT
Ok I've corrected it locally.  Any other issues?
Comment 10 Maciej Stachowiak 2006-04-03 15:54:05 PDT
Looks pretty good. A few minor comments:

- this should be in namespace KXMLCore, not namespace KJS
- ideally the headers should follow the kxmlcore naming convention, so Unicode.h, UnicodeICU.h, UnicodeQt4.h

I'll let you fix this up as part of landing, or else it can be fixed later, so r+.
Comment 11 George Staikos 2006-04-03 16:46:43 PDT
Done