Bug 48227

Summary:

[GTK] Handle surrogate pairs in TextBreakIteratorGtk

Product:

WebKit

Reporter:

Carlos Garcia Campos <cgarcia>

Component:

WebKitGTK

Assignee:

Nobody <webkit-unassigned>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, commit-queue, eric, mrobinson, webkit.review.bot

Priority:

Version:

528+ (Nightly build)

Hardware:

OS:

Linux

Bug Depends on:

Bug Blocks:

34247

Attachments:

Description	Flags
Patch to handle surrogate pairs in TextBreakIteratorGtk	mrobinson: review-
New patch to handle surrogate pairs	none
Previous patch with ChangeLog fixed	mrobinson: review-
Updated patch according to review	none

Carlos Garcia Campos

Reported 2010-10-25 03:58:39 PDT

Strings with surrogate pairs are not correctly handled when using the glib unicode backend.

Attachments
Patch to handle surrogate pairs in TextBreakIteratorGtk (7.98 KB, patch) 2010-10-25 04:03 PDT, Carlos Garcia Campos	mrobinson: review-	Details Formatted Diff Diff
New patch to handle surrogate pairs (13.74 KB, patch) 2010-10-27 05:40 PDT, Carlos Garcia Campos	no flags	Details Formatted Diff Diff
Previous patch with ChangeLog fixed (14.42 KB, patch) 2010-10-27 05:45 PDT, Carlos Garcia Campos	mrobinson: review-	Details Formatted Diff Diff
Updated patch according to review (14.42 KB, patch) 2010-10-28 05:10 PDT, Carlos Garcia Campos	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Carlos Garcia Campos

Comment 1 2010-10-25 04:03:51 PDT

Created attachment 71728 [details] Patch to handle surrogate pairs in TextBreakIteratorGtk TextBreakIteratorGtk uses utf8 because it's what pango expects, but we need to return indices for the given input string that it's in utf16. The number of characters is the same for both utf16 and ut8 except when the input string contains surrogate pairs. We need to keep both, the index for the utf8 string to be used internally, and the index for the utf16 string to be used as return value of the iterator interface. It fixes test fast/forms/textarea-maxlength.html

Martin Robinson

Comment 2 2010-10-25 11:36:49 PDT

Comment on attachment 71728 [details] Patch to handle surrogate pairs in TextBreakIteratorGtk View in context: https://bugs.webkit.org/attachment.cgi?id=71728&action=review Looking good, but here are some suggestions. > WebCore/ChangeLog:9 > + need to return indices for the given input string that it's in utf16. I think this should be "that it's in" --> "that are in" Should be 'UTF-8' and 'UTF-16' throughout. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:42 > + GOwnPtr<char> m_utf8; This should probably be a CString. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:115 > +static int characterSize(const gchar* str, glong offset) 'str' should be 'string' > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:119 > + gchar* p = g_utf8_offset_to_pointer(str, offset); > + gunichar c = g_utf8_get_char(p); > + return (c >= 0x10000 && c <= 0x10FFFF) ? 2 : 1; Avoid using one letter variable names. These should be full words. Can we use any of the stuff from JavaScriptCore/wtf/unicode/UnicodeMacrosFromICU.hto remove these magic numbers? > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:122 > +static int nextUtf16Step(TextBreakIterator* bi, int i) Again avoid abbreviations here. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:130 > +static int previousUtf16Step(TextBreakIterator* bi, int i) Ditto. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:140 > +static int getUTF8Index(TextBreakIterator* bi, int i) Ditto. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:159 > int textBreakFirst(TextBreakIterator* bi) Ditto. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:166 > int textBreakLast(TextBreakIterator* bi) Ditto. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:188 > + if ((whiteSpaceAtTheEnd = bi->m_logAttrs[pos].is_white)) { Why the extra parenthesis here? > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:201 > int textBreakNext(TextBreakIterator* bi) Ditto. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:223 > int textBreakPrevious(TextBreakIterator* bi) Ditto, etc.

Carlos Garcia Campos

Comment 3 2010-10-27 05:40:30 PDT

Created attachment 72019 [details] New patch to handle surrogate pairs This patch is not just an update of the previous one. In order to simplify the code I've added a new class CharacterIterator that takes care of the utf8 and untf16 indices. The break iterator uses this class to iterate over the input string. I've fixed all coding styles issues in the file too.

Carlos Garcia Campos

Comment 4 2010-10-27 05:45:08 PDT

Created attachment 72020 [details] Previous patch with ChangeLog fixed Sorry, I made a mistake generating changelog in the previous patch.

Martin Robinson

Comment 5 2010-10-28 00:07:34 PDT

Comment on attachment 72020 [details] Previous patch with ChangeLog fixed View in context: https://bugs.webkit.org/attachment.cgi?id=72020&action=review Very nice! I just have a couple very small comments. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:31 > +using namespace std; I think I'd prefer the functions you're using to be explicit here. using std::max; for example. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:33 > +#define IS_SURROGATE(character) (character >= 0x10000 && character <= 0x10FFFF) I think this should probably be called UTF8_IS_SURROGATE just to be clear that it deals with UTF-8 characters. > WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:77 > + long utf8len = 0; According to the style guidelines, this should be utf8Length.

Martin Robinson

Comment 6 2010-10-28 00:18:35 PDT

(In reply to comment #5) > I think I'd prefer the functions you're using to be explicit here. > using std::max; for example. Carlos pointed out that I'm wrong about this. The style guidelines are explicit that "using namespace std;" is preferred.

Carlos Garcia Campos

Comment 7 2010-10-28 05:10:41 PDT

Created attachment 72171 [details] Updated patch according to review

Martin Robinson

Comment 8 2010-10-28 09:24:00 PDT

Comment on attachment 72171 [details] Updated patch according to review Thanks!

WebKit Commit Bot

Comment 9 2010-10-29 08:39:14 PDT

Comment on attachment 72171 [details] Updated patch according to review Clearing flags on attachment: 72171 Committed r70881: <http://trac.webkit.org/changeset/70881>

WebKit Commit Bot

Comment 10 2010-10-29 08:39:21 PDT

All reviewed patches have been landed. Closing bug.

WebKit Review Bot

Comment 11 2010-10-29 09:58:52 PDT

http://trac.webkit.org/changeset/70881 might have broken Leopard Intel Release (Tests) The following tests are not passing: fast/workers/storage/interrupt-database-sync.html

Note You need to log in before you can comment on or make changes to this bug.