102996 – Grapheme cluster functions can be simplified for 8 bit Strings

RESOLVED FIXED102996

Grapheme cluster functions can be simplified for 8 bit Strings

https://bugs.webkit.org/show_bug.cgi?id=102996

Summary Grapheme cluster functions can be simplified for 8 bit Strings

Michael Saboff

Reported 2012-11-21 17:24:53 PST

numGraphemeClusters() and numCharactersInGraphemeClusters() currently process strings using a CharacterBreakIterator and 8 bit strings need to be up converted to 16 bits. According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. Upconverting an 8 bit string to 16 bits, then processing using a CharacterBreakIterator seems overkill. At a minimum, both functions could process 8 bit strings natively, looking for CR - LF pairs, treating them as one GraphemeCluster. Other optimizations may be possible.

Attachments
Patch (2.74 KB, patch) 2012-11-26 17:12 PST, Michael Saboff	no flags	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2012-11-22 23:56:06 PST

Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"? IIRC these functions are used to implement some fuzzily defined features. > According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included.

Michael Saboff

Comment 2 2012-11-26 14:02:16 PST

(In reply to comment #1) > Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"? I assume that we are online interested in extended grapheme clusters as that is what the ICU library claims to provide (from http://icu-project.org/apiref/icu4c/ubrk_8h.html in the detailed description section for BreakIterator C API): Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use. > IIRC these functions are used to implement some fuzzily defined features. > > > According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. > > I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included. Yes, I mean Latin-1 characters. I couldn't see any combinations with NBSP.

Alexey Proskuryakov

Comment 3 2012-11-26 14:11:15 PST

> Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Yes, this is why I'm asking. There is often additional context on the Web, such as page language, so using custom tailorings may be appropriate.

Michael Saboff

Comment 4 2012-11-26 17:12:08 PST

Created attachment 176119 [details] Patch After discussing with Alexey, we agreed that the current code handles Extended Grapheme Clusters and that we can simply look for the CR-LF combo. If we want to handle Tailored Graheme Clusters in the future, then this code will need to chang.

WebKit Review Bot

Comment 5 2012-11-26 19:52:17 PST

Comment on attachment 176119 [details] Patch Clearing flags on attachment: 176119 Committed r135805: <http://trac.webkit.org/changeset/135805>

WebKit Review Bot

Comment 6 2012-11-26 19:52:21 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Layout and Rendering

Assignee

Michael Saboff

Reported

2012-11-21 17:24 PST

Modified

2012-11-26 19:52 PST History

CC List

4 users Show

URL

Keywords

Depends on

Blocks