Bug 102996 - Grapheme cluster functions can be simplified for 8 bit Strings
Summary: Grapheme cluster functions can be simplified for 8 bit Strings
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Layout and Rendering (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-21 17:24 PST by Michael Saboff
Modified: 2012-11-26 19:52 PST (History)
4 users (show)

See Also:


Attachments
Patch (2.74 KB, patch)
2012-11-26 17:12 PST, Michael Saboff
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2012-11-21 17:24:53 PST
numGraphemeClusters() and numCharactersInGraphemeClusters() currently process strings using a CharacterBreakIterator and 8 bit strings need to be up converted to 16 bits. According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.  Upconverting an 8 bit string to 16 bits, then processing using a CharacterBreakIterator seems overkill.

At a minimum, both functions could process 8 bit strings natively, looking for CR - LF pairs, treating them as one GraphemeCluster.  Other optimizations may be possible.
Comment 1 Alexey Proskuryakov 2012-11-22 23:56:06 PST
Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"?

IIRC these functions are used to implement some fuzzily defined features.

> According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.

I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included.
Comment 2 Michael Saboff 2012-11-26 14:02:16 PST
(In reply to comment #1)
> Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"?

I assume that we are online interested in extended grapheme clusters as that is what the ICU library claims to provide (from http://icu-project.org/apiref/icu4c/ubrk_8h.html in the detailed description section for BreakIterator C API):

Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use.


> IIRC these functions are used to implement some fuzzily defined features.
> 
> > According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.
> 
> I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included.

Yes, I mean Latin-1 characters. I couldn't see any combinations with NBSP.
Comment 3 Alexey Proskuryakov 2012-11-26 14:11:15 PST
> Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations.

Yes, this is why I'm asking. There is often additional context on the Web, such as page language, so using custom tailorings may be appropriate.
Comment 4 Michael Saboff 2012-11-26 17:12:08 PST
Created attachment 176119 [details]
Patch

After discussing with Alexey, we agreed that the current code handles Extended Grapheme Clusters and that we can simply look for the CR-LF combo.  If we want to handle Tailored Graheme Clusters in the future, then this code will need to chang.
Comment 5 WebKit Review Bot 2012-11-26 19:52:17 PST
Comment on attachment 176119 [details]
Patch

Clearing flags on attachment: 176119

Committed r135805: <http://trac.webkit.org/changeset/135805>
Comment 6 WebKit Review Bot 2012-11-26 19:52:21 PST
All reviewed patches have been landed.  Closing bug.