numGraphemeClusters() and numCharactersInGraphemeClusters() currently process strings using a CharacterBreakIterator and 8 bit strings need to be up converted to 16 bits. According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. Upconverting an 8 bit string to 16 bits, then processing using a CharacterBreakIterator seems overkill. At a minimum, both functions could process 8 bit strings natively, looking for CR - LF pairs, treating them as one GraphemeCluster. Other optimizations may be possible.
Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"? IIRC these functions are used to implement some fuzzily defined features. > According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included.
(In reply to comment #1) > Is it actually extended grapheme clusters that we're ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak "ch"? I assume that we are online interested in extended grapheme clusters as that is what the ICU library claims to provide (from http://icu-project.org/apiref/icu4c/ubrk_8h.html in the detailed description section for BreakIterator C API): Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use. > IIRC these functions are used to implement some fuzzily defined features. > > > According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed. > > I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I'm not sure if combinations with non-breaking space U+00A0 are included. Yes, I mean Latin-1 characters. I couldn't see any combinations with NBSP.
> Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Yes, this is why I'm asking. There is often additional context on the Web, such as page language, so using custom tailorings may be appropriate.
Created attachment 176119 [details] Patch After discussing with Alexey, we agreed that the current code handles Extended Grapheme Clusters and that we can simply look for the CR-LF combo. If we want to handle Tailored Graheme Clusters in the future, then this code will need to chang.
Comment on attachment 176119 [details] Patch Clearing flags on attachment: 176119 Committed r135805: <http://trac.webkit.org/changeset/135805>
All reviewed patches have been landed. Closing bug.