WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
102996
Grapheme cluster functions can be simplified for 8 bit Strings
https://bugs.webkit.org/show_bug.cgi?id=102996
Summary
Grapheme cluster functions can be simplified for 8 bit Strings
Michael Saboff
Reported
2012-11-21 17:24:53 PST
numGraphemeClusters() and numCharactersInGraphemeClusters() currently process strings using a CharacterBreakIterator and 8 bit strings need to be up converted to 16 bits. According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. Upconverting an 8 bit string to 16 bits, then processing using a CharacterBreakIterator seems overkill. At a minimum, both functions could process 8 bit strings natively, looking for CR - LF pairs, treating them as one GraphemeCluster. Other optimizations may be possible.
Attachments
Patch
(2.74 KB, patch)
2012-11-26 17:12 PST
,
Michael Saboff
no flags
Details
Formatted Diff
Diff
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2012-11-22 23:56:06 PST
Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"? IIRC these functions are used to implement some fuzzily defined features.
> According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.
I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included.
Michael Saboff
Comment 2
2012-11-26 14:02:16 PST
(In reply to
comment #1
)
> Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"?
I assume that we are online interested in extended grapheme clusters as that is what the ICU library claims to provide (from
http://icu-project.org/apiref/icu4c/ubrk_8h.html
in the detailed description section for BreakIterator C API): Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation,
http://www.unicode.org/reports/tr29/
for additional information on grapheme clusters and guidelines on their use.
> IIRC these functions are used to implement some fuzzily defined features. > > > According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. > > I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included.
Yes, I mean Latin-1 characters. I couldn't see any combinations with NBSP.
Alexey Proskuryakov
Comment 3
2012-11-26 14:11:15 PST
> Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations.
Yes, this is why I'm asking. There is often additional context on the Web, such as page language, so using custom tailorings may be appropriate.
Michael Saboff
Comment 4
2012-11-26 17:12:08 PST
Created
attachment 176119
[details]
Patch After discussing with Alexey, we agreed that the current code handles Extended Grapheme Clusters and that we can simply look for the CR-LF combo. If we want to handle Tailored Graheme Clusters in the future, then this code will need to chang.
WebKit Review Bot
Comment 5
2012-11-26 19:52:17 PST
Comment on
attachment 176119
[details]
Patch Clearing flags on attachment: 176119 Committed
r135805
: <
http://trac.webkit.org/changeset/135805
>
WebKit Review Bot
Comment 6
2012-11-26 19:52:21 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug