Bug 44790 - Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Summary: Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-27 13:12 PDT by Adam Barth
Modified: 2010-08-29 23:55 PDT (History)
4 users (show)

See Also:


Attachments
Patch (13.73 KB, patch)
2010-08-27 13:14 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch for landing (14.52 KB, patch)
2010-08-29 23:12 PDT, Adam Barth
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2010-08-27 13:12:52 PDT
Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Comment 1 Adam Barth 2010-08-27 13:14:55 PDT
Created attachment 65756 [details]
Patch
Comment 2 Eric Seidel (no email) 2010-08-27 14:34:34 PDT
Comment on attachment 65756 [details]
Patch

WebCore/html/HTMLEntityParser.cpp:65
 +  inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity)
I'm confused why this returns a bool if it's always true.

WebCore/html/HTMLEntityParser.h:34
 +  bool consumeHTMLEntity(SegmentedString&, Vector<UChar>& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0');
This is always at most 2 chars.  Vector seems like overkill.

WebCore/html/HTMLTokenizer.cpp:122
 +      Vector<UChar> decodedEntity;
Inline capacity 2?

Using a vector here feels strange.  It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points.
Comment 3 Adam Barth 2010-08-27 14:46:46 PDT
> I'm confused why this returns a bool if it's always true.

Just to make the callsites look snazzy.  We always want to return true after calling this function.

> This is always at most 2 chars.  Vector seems like overkill.

Secretly I want to prepare for entities that expand to more than two characters.

> WebCore/html/HTMLTokenizer.cpp:122
>  +      Vector<UChar> decodedEntity;
> Inline capacity 2?

Sure.

> Using a vector here feels strange.  It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points.

Would you prefer UChar decodedEntity[2] ?  Generally, I like Vector because it makes sure we have memory safety.
Comment 4 Darin Adler 2010-08-28 16:10:54 PDT
Comment on attachment 65756 [details]
Patch

> +        * html/HTMLEntityParser.cpp:
> +        (WebCore::):
> +        (WebCore::consumeHTMLEntity):

This list of function names is a bit strange. As you know, I am a big fan of per-function change log comments, but just removing the bogus function name might be enough. For some reason the new function's name is not here either.

> +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity)

Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code?

The argument name "value" here is strange. We have a type UChar32 that we often use instead of unsigned for such things, by the way.

The name convertToCharacters is a bit unclear since this converts a Unicode character into a sequence of UTF-16 code units. It actually converts *from* a character. I think you should use a different name.

> +    if (value < 0xFFFF) {

Why < here instead of <=? The old code used "< 0xFFFF" in one place and "> 0xFFFF" in another, which is inconsistent.

ICU has the macro U_IS_BMP for this operation, which I think is slightly more explicit than doing the math directly.

> +    Vector<UChar> decodedEntity;
> +    bool success = consumeHTMLEntity(source, decodedEntity, notEnoughCharacters);

I don't know how hot this code path is, but if it's hot enough, we might want to use Vector<UChar, 16> for this instead of Vector<UChar> to avoid a memory allocation each time.

r=me
Comment 5 Adam Barth 2010-08-28 22:27:47 PDT
Thanks for the comments.

> > +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity)
> 
> Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code?

Ah, I was under the impression that inline implies static.  I thought that's what one of the compiler gurus told me, but http://gcc.gnu.org/onlinedocs/gcc/Inline.html seems to imply that static is not redundant.  In this case, the code is inside an anonymous namespace.  I didn't know there were tooling issues with anonymous namespaces.  Maybe those tooling issues are out of date since we use anonymous namespaces in a bunch of places?
Comment 6 Adam Barth 2010-08-29 23:11:12 PDT
Ok.  I think I've addressed all your comments.  Let's see if the bot likes the patch.
Comment 7 Adam Barth 2010-08-29 23:12:31 PDT
Created attachment 65879 [details]
Patch for landing
Comment 8 WebKit Commit Bot 2010-08-29 23:31:36 PDT
Comment on attachment 65879 [details]
Patch for landing

Clearing flags on attachment: 65879

Committed r66359: <http://trac.webkit.org/changeset/66359>
Comment 9 WebKit Commit Bot 2010-08-29 23:31:41 PDT
All reviewed patches have been landed.  Closing bug.
Comment 10 WebKit Review Bot 2010-08-29 23:37:12 PDT
http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release
The following changes are on the blame list:
http://trac.webkit.org/changeset/66360
http://trac.webkit.org/changeset/66359
Comment 11 Adam Barth 2010-08-29 23:55:59 PDT
(In reply to comment #10)
> http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release

Thanks Darin.  I was working on a build fix, but yours is much nice.