<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.webkit.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4.1"
          urlbase="https://bugs.webkit.org/"
          
          maintainer="admin@webkit.org"
>

    <bug>
          <bug_id>102996</bug_id>
          
          <creation_ts>2012-11-21 17:24:53 -0800</creation_ts>
          <short_desc>Grapheme cluster functions can be simplified for 8 bit Strings</short_desc>
          <delta_ts>2012-11-26 19:52:21 -0800</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebKit</product>
          <component>Layout and Rendering</component>
          <version>528+ (Nightly build)</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>Normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Michael Saboff">msaboff</reporter>
          <assigned_to name="Michael Saboff">msaboff</assigned_to>
          <cc>ap</cc>
    
    <cc>darin</cc>
    
    <cc>mitz</cc>
    
    <cc>webkit.review.bot</cc>
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>773443</commentid>
    <comment_count>0</comment_count>
    <who name="Michael Saboff">msaboff</who>
    <bug_when>2012-11-21 17:24:53 -0800</bug_when>
    <thetext>numGraphemeClusters() and numCharactersInGraphemeClusters() currently process strings using a CharacterBreakIterator and 8 bit strings need to be up converted to 16 bits. According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.  Upconverting an 8 bit string to 16 bits, then processing using a CharacterBreakIterator seems overkill.

At a minimum, both functions could process 8 bit strings natively, looking for CR - LF pairs, treating them as one GraphemeCluster.  Other optimizations may be possible.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>774259</commentid>
    <comment_count>1</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2012-11-22 23:56:06 -0800</bug_when>
    <thetext>Is it actually extended grapheme clusters that we&apos;re ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak &quot;ch&quot;?

IIRC these functions are used to implement some fuzzily defined features.

&gt; According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.

I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I&apos;m not sure if combinations with non-breaking space U+00A0 are included.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>775866</commentid>
    <comment_count>2</comment_count>
    <who name="Michael Saboff">msaboff</who>
    <bug_when>2012-11-26 14:02:16 -0800</bug_when>
    <thetext>(In reply to comment #1)
&gt; Is it actually extended grapheme clusters that we&apos;re ultimately interested in there, and not e.g. tailored grapheme clusters, like Slovak &quot;ch&quot;?

I assume that we are online interested in extended grapheme clusters as that is what the ICU library claims to provide (from http://icu-project.org/apiref/icu4c/ubrk_8h.html in the detailed description section for BreakIterator C API):

Character boundary analysis identifies the boundaries of &quot;Extended Grapheme Clusters&quot;, which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use.


&gt; IIRC these functions are used to implement some fuzzily defined features.
&gt; 
&gt; &gt; According to the Unicode spec, the only extended grapheme cluster is a carriage return followed by a line feed.
&gt; 
&gt; I presume that you meant Latin-1 characters only. From a cursory glance at the spec, I&apos;m not sure if combinations with non-breaking space U+00A0 are included.

Yes, I mean Latin-1 characters. I couldn&apos;t see any combinations with NBSP.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>775874</commentid>
    <comment_count>3</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2012-11-26 14:11:15 -0800</bug_when>
    <thetext>&gt; Character boundary analysis identifies the boundaries of &quot;Extended Grapheme Clusters&quot;, which are groupings of codepoints that should be treated as character-like units for many text operations.

Yes, this is why I&apos;m asking. There is often additional context on the Web, such as page language, so using custom tailorings may be appropriate.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>776103</commentid>
    <comment_count>4</comment_count>
      <attachid>176119</attachid>
    <who name="Michael Saboff">msaboff</who>
    <bug_when>2012-11-26 17:12:08 -0800</bug_when>
    <thetext>Created attachment 176119
Patch

After discussing with Alexey, we agreed that the current code handles Extended Grapheme Clusters and that we can simply look for the CR-LF combo.  If we want to handle Tailored Graheme Clusters in the future, then this code will need to chang.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>776231</commentid>
    <comment_count>5</comment_count>
      <attachid>176119</attachid>
    <who name="WebKit Review Bot">webkit.review.bot</who>
    <bug_when>2012-11-26 19:52:17 -0800</bug_when>
    <thetext>Comment on attachment 176119
Patch

Clearing flags on attachment: 176119

Committed r135805: &lt;http://trac.webkit.org/changeset/135805&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>776232</commentid>
    <comment_count>6</comment_count>
    <who name="WebKit Review Bot">webkit.review.bot</who>
    <bug_when>2012-11-26 19:52:21 -0800</bug_when>
    <thetext>All reviewed patches have been landed.  Closing bug.</thetext>
  </long_desc>
      
          <attachment
              isobsolete="0"
              ispatch="1"
              isprivate="0"
          >
            <attachid>176119</attachid>
            <date>2012-11-26 17:12:08 -0800</date>
            <delta_ts>2012-11-26 19:52:17 -0800</delta_ts>
            <desc>Patch</desc>
            <filename>102996.patch</filename>
            <type>text/plain</type>
            <size>2804</size>
            <attacher name="Michael Saboff">msaboff</attacher>
            
              <data encoding="base64">SW5kZXg6IFNvdXJjZS9XZWJDb3JlL0NoYW5nZUxvZwo9PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBTb3VyY2UvV2Vi
Q29yZS9DaGFuZ2VMb2cJKHJldmlzaW9uIDEzNTc4OCkKKysrIFNvdXJjZS9XZWJDb3JlL0NoYW5n
ZUxvZwkod29ya2luZyBjb3B5KQpAQCAtMSwzICsxLDIwIEBACisyMDEyLTExLTI2ICBNaWNoYWVs
IFNhYm9mZiAgPG1zYWJvZmZAYXBwbGUuY29tPgorCisgICAgICAgIEdyYXBoZW1lIGNsdXN0ZXIg
ZnVuY3Rpb25zIGNhbiBiZSBzaW1wbGlmaWVkIGZvciA4IGJpdCBTdHJpbmdzCisgICAgICAgIGh0
dHBzOi8vYnVncy53ZWJraXQub3JnL3Nob3dfYnVnLmNnaT9pZD0xMDI5OTYKKworICAgICAgICBS
ZXZpZXdlZCBieSBOT0JPRFkgKE9PUFMhKS4KKworICAgICAgICBGb3IgOCBiaXQgc3RyaW5ncywg
Y2hlY2sgZm9yIHRoZSB1bmNvbW1vbiBDUi1MRiBieSBsb29raW5nIGZvciBhbnkgQ1IuICBJZiB0
aGVyZSBhcmVuJ3QgYW55IENSIGNoYXJhY3RlcnMsCisgICAgICAgIHRoZSBudW1iZXIgb2YgRXh0
ZW5kZWQgR3JhcGhlbWUgQ2x1c3RlcnMgaXMgZXF1YWwgdG8gdGhlIHN0cmluZyBsZW5ndGguICBJ
ZiB3ZSBuZWVkIHRvIGhhbmRsZSBUYWlsb3JlZAorICAgICAgICBHcmFoZW1lIENsdXN0ZXJzLCB0
aGVuIHRoaXMgd2lsbCBuZWVkIHRvIGNoYW5nZS4KKworICAgICAgICBObyBuZXcgdGVzdHMuIE5v
IGNoYW5nZSBpbiBmdW5jdGlvbmFsaXR5LgorCisgICAgICAgICogcGxhdGZvcm0vdGV4dC9UZXh0
QnJlYWtJdGVyYXRvci5jcHA6CisgICAgICAgIChXZWJDb3JlOjpudW1HcmFwaGVtZUNsdXN0ZXJz
KToKKyAgICAgICAgKFdlYkNvcmU6Om51bUNoYXJhY3RlcnNJbkdyYXBoZW1lQ2x1c3RlcnMpOgor
CiAyMDEyLTExLTI2ICBBbmRyZWFzIEtsaW5nICA8YWtsaW5nQGFwcGxlLmNvbT4KIAogICAgICAg
ICBSZW5kZXJTdHlsZTogTW92ZSAnbGlzdC1zdHlsZS1pbWFnZScgdG8gcmFyZSBpbmhlcml0ZWQg
ZGF0YS4KSW5kZXg6IFNvdXJjZS9XZWJDb3JlL3BsYXRmb3JtL3RleHQvVGV4dEJyZWFrSXRlcmF0
b3IuY3BwCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT0KLS0tIFNvdXJjZS9XZWJDb3JlL3BsYXRmb3JtL3RleHQvVGV4dEJy
ZWFrSXRlcmF0b3IuY3BwCShyZXZpc2lvbiAxMzU2NDApCisrKyBTb3VyY2UvV2ViQ29yZS9wbGF0
Zm9ybS90ZXh0L1RleHRCcmVha0l0ZXJhdG9yLmNwcAkod29ya2luZyBjb3B5KQpAQCAtMjYsOSAr
MjYsMTggQEAgbmFtZXNwYWNlIFdlYkNvcmUgewogCiB1bnNpZ25lZCBudW1HcmFwaGVtZUNsdXN0
ZXJzKGNvbnN0IFN0cmluZyYgcykKIHsKLSAgICBOb25TaGFyZWRDaGFyYWN0ZXJCcmVha0l0ZXJh
dG9yIGl0KHMuY2hhcmFjdGVycygpLCBzLmxlbmd0aCgpKTsKKyAgICB1bnNpZ25lZCBzdHJpbmdM
ZW5ndGggPSBzLmxlbmd0aCgpOworICAgIAorICAgIGlmICghc3RyaW5nTGVuZ3RoKQorICAgICAg
ICByZXR1cm4gMDsKKworICAgIC8vIFRoZSBvbmx5IExhdGluLTEgRXh0ZW5kZWQgR3JhcGhlbWUg
Q2x1c3RlciBpcyBDUiBMRgorICAgIGlmIChzLmlzOEJpdCgpICYmICFzLmNvbnRhaW5zKCdccicp
KQorICAgICAgICByZXR1cm4gc3RyaW5nTGVuZ3RoOworCisgICAgTm9uU2hhcmVkQ2hhcmFjdGVy
QnJlYWtJdGVyYXRvciBpdChzLmNoYXJhY3RlcnMoKSwgc3RyaW5nTGVuZ3RoKTsKICAgICBpZiAo
IWl0KQotICAgICAgICByZXR1cm4gcy5sZW5ndGgoKTsKKyAgICAgICAgcmV0dXJuIHN0cmluZ0xl
bmd0aDsKIAogICAgIHVuc2lnbmVkIG51bSA9IDA7CiAgICAgd2hpbGUgKHRleHRCcmVha05leHQo
aXQpICE9IFRleHRCcmVha0RvbmUpCkBAIC0zOCwxMyArNDcsMjIgQEAgdW5zaWduZWQgbnVtR3Jh
cGhlbWVDbHVzdGVycyhjb25zdCBTdHJpbgogCiB1bnNpZ25lZCBudW1DaGFyYWN0ZXJzSW5HcmFw
aGVtZUNsdXN0ZXJzKGNvbnN0IFN0cmluZyYgcywgdW5zaWduZWQgbnVtR3JhcGhlbWVDbHVzdGVy
cykKIHsKLSAgICBOb25TaGFyZWRDaGFyYWN0ZXJCcmVha0l0ZXJhdG9yIGl0KHMuY2hhcmFjdGVy
cygpLCBzLmxlbmd0aCgpKTsKKyAgICB1bnNpZ25lZCBzdHJpbmdMZW5ndGggPSBzLmxlbmd0aCgp
OworCisgICAgaWYgKCFzdHJpbmdMZW5ndGgpCisgICAgICAgIHJldHVybiAwOworCisgICAgLy8g
VGhlIG9ubHkgTGF0aW4tMSBFeHRlbmRlZCBHcmFwaGVtZSBDbHVzdGVyIGlzIENSIExGCisgICAg
aWYgKHMuaXM4Qml0KCkgJiYgIXMuY29udGFpbnMoJ1xyJykpCisgICAgICAgIHJldHVybiBzdGQ6
Om1pbihzdHJpbmdMZW5ndGgsIG51bUdyYXBoZW1lQ2x1c3RlcnMpOworCisgICAgTm9uU2hhcmVk
Q2hhcmFjdGVyQnJlYWtJdGVyYXRvciBpdChzLmNoYXJhY3RlcnMoKSwgc3RyaW5nTGVuZ3RoKTsK
ICAgICBpZiAoIWl0KQotICAgICAgICByZXR1cm4gc3RkOjptaW4ocy5sZW5ndGgoKSwgbnVtR3Jh
cGhlbWVDbHVzdGVycyk7CisgICAgICAgIHJldHVybiBzdGQ6Om1pbihzdHJpbmdMZW5ndGgsIG51
bUdyYXBoZW1lQ2x1c3RlcnMpOwogCiAgICAgZm9yICh1bnNpZ25lZCBpID0gMDsgaSA8IG51bUdy
YXBoZW1lQ2x1c3RlcnM7ICsraSkgewogICAgICAgICBpZiAodGV4dEJyZWFrTmV4dChpdCkgPT0g
VGV4dEJyZWFrRG9uZSkKLSAgICAgICAgICAgIHJldHVybiBzLmxlbmd0aCgpOworICAgICAgICAg
ICAgcmV0dXJuIHN0cmluZ0xlbmd0aDsKICAgICB9CiAgICAgcmV0dXJuIHRleHRCcmVha0N1cnJl
bnQoaXQpOwogfQo=
</data>

          </attachment>
      

    </bug>

</bugzilla>