Bug 44790

Summary:

Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser

Product:

WebKit

Reporter:

Adam Barth <abarth>

Component:

New Bugs

Assignee:

Adam Barth <abarth>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, commit-queue, eric, webkit.review.bot

Priority:

Version:

528+ (Nightly build)

Hardware:

Other

OS:

OS X 10.5

Attachments:

Description	Flags
Patch	none
Patch for landing	none

Adam Barth

Reported 2010-08-27 13:12:52 PDT

Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser

Attachments
Patch (13.73 KB, patch) 2010-08-27 13:14 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Patch for landing (14.52 KB, patch) 2010-08-29 23:12 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Adam Barth

Comment 1 2010-08-27 13:14:55 PDT

Created attachment 65756 [details] Patch

Eric Seidel (no email)

Comment 2 2010-08-27 14:34:34 PDT

Comment on attachment 65756 [details] Patch WebCore/html/HTMLEntityParser.cpp:65 + inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity) I'm confused why this returns a bool if it's always true. WebCore/html/HTMLEntityParser.h:34 + bool consumeHTMLEntity(SegmentedString&, Vector<UChar>& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0'); This is always at most 2 chars. Vector seems like overkill. WebCore/html/HTMLTokenizer.cpp:122 + Vector<UChar> decodedEntity; Inline capacity 2? Using a vector here feels strange. It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points.

Adam Barth

Comment 3 2010-08-27 14:46:46 PDT

> I'm confused why this returns a bool if it's always true. Just to make the callsites look snazzy. We always want to return true after calling this function. > This is always at most 2 chars. Vector seems like overkill. Secretly I want to prepare for entities that expand to more than two characters. > WebCore/html/HTMLTokenizer.cpp:122 > + Vector<UChar> decodedEntity; > Inline capacity 2? Sure. > Using a vector here feels strange. It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points. Would you prefer UChar decodedEntity[2] ? Generally, I like Vector because it makes sure we have memory safety.

Darin Adler

Comment 4 2010-08-28 16:10:54 PDT

Comment on attachment 65756 [details] Patch > + * html/HTMLEntityParser.cpp: > + (WebCore::): > + (WebCore::consumeHTMLEntity): This list of function names is a bit strange. As you know, I am a big fan of per-function change log comments, but just removing the bogus function name might be enough. For some reason the new function's name is not here either. > +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity) Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code? The argument name "value" here is strange. We have a type UChar32 that we often use instead of unsigned for such things, by the way. The name convertToCharacters is a bit unclear since this converts a Unicode character into a sequence of UTF-16 code units. It actually converts *from* a character. I think you should use a different name. > + if (value < 0xFFFF) { Why < here instead of <=? The old code used "< 0xFFFF" in one place and "> 0xFFFF" in another, which is inconsistent. ICU has the macro U_IS_BMP for this operation, which I think is slightly more explicit than doing the math directly. > + Vector<UChar> decodedEntity; > + bool success = consumeHTMLEntity(source, decodedEntity, notEnoughCharacters); I don't know how hot this code path is, but if it's hot enough, we might want to use Vector<UChar, 16> for this instead of Vector<UChar> to avoid a memory allocation each time. r=me

Adam Barth

Comment 5 2010-08-28 22:27:47 PDT

Thanks for the comments. > > +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity) > > Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code? Ah, I was under the impression that inline implies static. I thought that's what one of the compiler gurus told me, but http://gcc.gnu.org/onlinedocs/gcc/Inline.html seems to imply that static is not redundant. In this case, the code is inside an anonymous namespace. I didn't know there were tooling issues with anonymous namespaces. Maybe those tooling issues are out of date since we use anonymous namespaces in a bunch of places?

Adam Barth

Comment 6 2010-08-29 23:11:12 PDT

Ok. I think I've addressed all your comments. Let's see if the bot likes the patch.

Adam Barth

Comment 7 2010-08-29 23:12:31 PDT

Created attachment 65879 [details] Patch for landing

WebKit Commit Bot

Comment 8 2010-08-29 23:31:36 PDT

Comment on attachment 65879 [details] Patch for landing Clearing flags on attachment: 65879 Committed r66359: <http://trac.webkit.org/changeset/66359>

WebKit Commit Bot

Comment 9 2010-08-29 23:31:41 PDT

All reviewed patches have been landed. Closing bug.

WebKit Review Bot

Comment 10 2010-08-29 23:37:12 PDT

http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release The following changes are on the blame list: http://trac.webkit.org/changeset/66360 http://trac.webkit.org/changeset/66359

Adam Barth

Comment 11 2010-08-29 23:55:59 PDT

(In reply to comment #10) > http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release Thanks Darin. I was working on a build fix, but yours is much nice.

Note You need to log in before you can comment on or make changes to this bug.