Bug 44790

Summary: Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Product: WebKit Reporter: Adam Barth <abarth>
Component: New BugsAssignee: Adam Barth <abarth>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, commit-queue, eric, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Other   
OS: OS X 10.5   
Attachments:
Description Flags
Patch
none
Patch for landing none

Adam Barth
Reported 2010-08-27 13:12:52 PDT
Move UTF16 LEAD/TRAIL logic into the HTMLEntityParser
Attachments
Patch (13.73 KB, patch)
2010-08-27 13:14 PDT, Adam Barth
no flags
Patch for landing (14.52 KB, patch)
2010-08-29 23:12 PDT, Adam Barth
no flags
Adam Barth
Comment 1 2010-08-27 13:14:55 PDT
Eric Seidel (no email)
Comment 2 2010-08-27 14:34:34 PDT
Comment on attachment 65756 [details] Patch WebCore/html/HTMLEntityParser.cpp:65 + inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity) I'm confused why this returns a bool if it's always true. WebCore/html/HTMLEntityParser.h:34 + bool consumeHTMLEntity(SegmentedString&, Vector<UChar>& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0'); This is always at most 2 chars. Vector seems like overkill. WebCore/html/HTMLTokenizer.cpp:122 + Vector<UChar> decodedEntity; Inline capacity 2? Using a vector here feels strange. It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points.
Adam Barth
Comment 3 2010-08-27 14:46:46 PDT
> I'm confused why this returns a bool if it's always true. Just to make the callsites look snazzy. We always want to return true after calling this function. > This is always at most 2 chars. Vector seems like overkill. Secretly I want to prepare for entities that expand to more than two characters. > WebCore/html/HTMLTokenizer.cpp:122 > + Vector<UChar> decodedEntity; > Inline capacity 2? Sure. > Using a vector here feels strange. It makes sense in your possible future world if many-char entities, but not much for when we at most produce 2 utf16 code points. Would you prefer UChar decodedEntity[2] ? Generally, I like Vector because it makes sure we have memory safety.
Darin Adler
Comment 4 2010-08-28 16:10:54 PDT
Comment on attachment 65756 [details] Patch > + * html/HTMLEntityParser.cpp: > + (WebCore::): > + (WebCore::consumeHTMLEntity): This list of function names is a bit strange. As you know, I am a big fan of per-function change log comments, but just removing the bogus function name might be enough. For some reason the new function's name is not here either. > +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity) Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code? The argument name "value" here is strange. We have a type UChar32 that we often use instead of unsigned for such things, by the way. The name convertToCharacters is a bit unclear since this converts a Unicode character into a sequence of UTF-16 code units. It actually converts *from* a character. I think you should use a different name. > + if (value < 0xFFFF) { Why < here instead of <=? The old code used "< 0xFFFF" in one place and "> 0xFFFF" in another, which is inconsistent. ICU has the macro U_IS_BMP for this operation, which I think is slightly more explicit than doing the math directly. > + Vector<UChar> decodedEntity; > + bool success = consumeHTMLEntity(source, decodedEntity, notEnoughCharacters); I don't know how hot this code path is, but if it's hot enough, we might want to use Vector<UChar, 16> for this instead of Vector<UChar> to avoid a memory allocation each time. r=me
Adam Barth
Comment 5 2010-08-28 22:27:47 PDT
Thanks for the comments. > > +inline bool convertToCharacters(unsigned value, Vector<UChar>& decodedEntity) > > Functions defined and used within the same file should be marked static so they get internal linkage, unless this is inside an unnamed namespace. We have been using static rather than unnamed namespaces because tool support for unnamed namespaces was lacking, but perhaps you decided on the other path for this code? Ah, I was under the impression that inline implies static. I thought that's what one of the compiler gurus told me, but http://gcc.gnu.org/onlinedocs/gcc/Inline.html seems to imply that static is not redundant. In this case, the code is inside an anonymous namespace. I didn't know there were tooling issues with anonymous namespaces. Maybe those tooling issues are out of date since we use anonymous namespaces in a bunch of places?
Adam Barth
Comment 6 2010-08-29 23:11:12 PDT
Ok. I think I've addressed all your comments. Let's see if the bot likes the patch.
Adam Barth
Comment 7 2010-08-29 23:12:31 PDT
Created attachment 65879 [details] Patch for landing
WebKit Commit Bot
Comment 8 2010-08-29 23:31:36 PDT
Comment on attachment 65879 [details] Patch for landing Clearing flags on attachment: 65879 Committed r66359: <http://trac.webkit.org/changeset/66359>
WebKit Commit Bot
Comment 9 2010-08-29 23:31:41 PDT
All reviewed patches have been landed. Closing bug.
WebKit Review Bot
Comment 10 2010-08-29 23:37:12 PDT
http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release The following changes are on the blame list: http://trac.webkit.org/changeset/66360 http://trac.webkit.org/changeset/66359
Adam Barth
Comment 11 2010-08-29 23:55:59 PDT
(In reply to comment #10) > http://trac.webkit.org/changeset/66359 might have broken Qt Linux Release minimal, Qt Linux ARMv5 Release, and Qt Linux ARMv7 Release Thanks Darin. I was working on a build fix, but yours is much nice.
Note You need to log in before you can comment on or make changes to this bug.