Bug 44790

Summary: Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Product: WebKit Reporter: Adam Barth <abarth>
Component: New BugsAssignee: Adam Barth <abarth>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, commit-queue, eric, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Other   
OS: OS X 10.5   
Attachments:
Description Flags
Patch
none
Patch for landing none

Description Adam Barth 2010-08-27 13:12:52 PDT
Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Comment 1 Adam Barth 2010-08-27 13:14:55 PDT
Created attachment 65756 [details]
Patch
Comment 2 Eric Seidel (no email) 2010-08-27 14:34:34 PDT
Comment on attachment 65756 [details]
Patch

WebCore/html/HTMLEntityParser.cpp:65
 +  inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity)
I'm confused why this returns a bool if it's always true.

WebCore/html/HTMLEntityParser.h:34
 +  bool consumeHTMLEntity(SegmentedString&, Vector<UChar>& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0');
This is always at most 2 chars.  Vector seems like overkill.

WebCore/html/HTMLTokenizer.cpp:122
 +      Vector<UChar> decodedEntity;
Inline capacity 2?

Using a vector here feels strange.  It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points.
Comment 3 Adam Barth 2010-08-27 14:46:46 PDT
> I'm confused why this returns a bool if it's always true.

Just to make the callsites look snazzy.  We always want to return true after calling this function.

> This is always at most 2 chars.  Vector seems like overkill.

Secretly I want to prepare for entities that expand to more than two characters.

> WebCore/html/HTMLTokenizer.cpp:122
>  +      Vector<UChar> decodedEntity;
> Inline capacity 2?

Sure.

> Using a vector here feels strange.  It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points.

Would you prefer UChar decodedEntity[2] ?  Generally, I like Vector because it makes sure we have memory safety.
Comment 4 Darin Adler 2010-08-28 16:10:54 PDT
Comment on attachment 65756 [details]
Patch

> +        * html/HTMLEntityParser.cpp:
> +        (WebCore::):
> +        (WebCore::consumeHTMLEntity):

This list of function names is a bit strange. As you know, I am a big fan of per-function change log comments, but just removing the bogus function name might be enough. For some reason the new function's name is not here either.

> +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity)

Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code?

The argument name "value" here is strange. We have a type UChar32 that we often use instead of unsigned for such things, by the way.

The name convertToCharacters is a bit unclear since this converts a Unicode character into a sequence of UTF-16 code units. It actually converts *from* a character. I think you should use a different name.

> +    if (value < 0xFFFF) {

Why < here instead of <=? The old code used "< 0xFFFF" in one place and "> 0xFFFF" in another, which is inconsistent.

ICU has the macro U_IS_BMP for this operation, which I think is slightly more explicit than doing the math directly.

> +    Vector<UChar> decodedEntity;
> +    bool success = consumeHTMLEntity(source, decodedEntity, notEnoughCharacters);

I don't know how hot this code path is, but if it's hot enough, we might want to use Vector<UChar, 16> for this instead of Vector<UChar> to avoid a memory allocation each time.

r=me
Comment 5 Adam Barth 2010-08-28 22:27:47 PDT
Thanks for the comments.

> > +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity)
> 
> Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code?

Ah, I was under the impression that inline implies static.  I thought that's what one of the compiler gurus told me, but http://gcc.gnu.org/onlinedocs/gcc/Inline.html seems to imply that static is not redundant.  In this case, the code is inside an anonymous namespace.  I didn't know there were tooling issues with anonymous namespaces.  Maybe those tooling issues are out of date since we use anonymous namespaces in a bunch of places?
Comment 6 Adam Barth 2010-08-29 23:11:12 PDT
Ok.  I think I've addressed all your comments.  Let's see if the bot likes the patch.
Comment 7 Adam Barth 2010-08-29 23:12:31 PDT
Created attachment 65879 [details]
Patch for landing
Comment 8 WebKit Commit Bot 2010-08-29 23:31:36 PDT
Comment on attachment 65879 [details]
Patch for landing

Clearing flags on attachment: 65879

Committed r66359: <http://trac.webkit.org/changeset/66359>
Comment 9 WebKit Commit Bot 2010-08-29 23:31:41 PDT
All reviewed patches have been landed.  Closing bug.
Comment 10 WebKit Review Bot 2010-08-29 23:37:12 PDT
http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release
The following changes are on the blame list:
http://trac.webkit.org/changeset/66360
http://trac.webkit.org/changeset/66359
Comment 11 Adam Barth 2010-08-29 23:55:59 PDT
(In reply to comment #10)
> http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release

Thanks Darin.  I was working on a build fix, but yours is much nice.