Bug 44831 - The logic to escape entities in appendEscapedContent and appendAttributeValue should be merged
Summary: The logic to escape entities in appendEscapedContent and appendAttributeValue...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Enhancement
Assignee: Ryosuke Niwa
URL:
Keywords:
Depends on: 44288
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-28 14:47 PDT by Ryosuke Niwa
Modified: 2010-08-28 22:11 PDT (History)
5 users (show)

See Also:


Attachments
cleanup (10.35 KB, patch)
2010-08-28 15:07 PDT, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
fixed per Darin's comment (10.98 KB, patch)
2010-08-28 17:30 PDT, Ryosuke Niwa
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ryosuke Niwa 2010-08-28 14:47:35 PDT
Currently appendEscapedContent and appendAttributeValue separately implements entity escape mechanism.  But this is a duplication of code and should be merged.
Comment 1 Ryosuke Niwa 2010-08-28 15:07:45 PDT
Created attachment 65841 [details]
cleanup
Comment 2 Darin Adler 2010-08-28 15:34:47 PDT
Comment on attachment 65841 [details]
cleanup

The uses of the word "escape" here don't make that much sense to me. I don't think we normally talk about entities as "escaped".

> +enum EntityEscapeFlag {
> +    EscapeNone  = 0x0000,
> +    EscapeAmp   = 0x0001,
> +    EscapeLt    = 0x0002,
> +    EscapeGt    = 0x0004,
> +    EscapeQuot  = 0x0008,
> +    EscapeNbsp  = 0x0010,
> +};

We normally avoid lining things up like this because if you change names it ends up unaligned later.

> +static void escapeEntities(Vector<UChar>& out, const UChar* content, size_t length, int entitiesToEscape)

The variable name "entitiesToEscape" here should have the word "mask" or something like that in it, because otherwise it seems it could be a count.

I think that escapeEntities is not a great name for a function that appends to a vector while turning certain characters into entity sequences. I would call this appendCharactersAndEntities or something like that. The key is that it's an append function, not an escape function, since it appends characters.

> +    size_t positionAfterLastEntity = 0;
> +    for (size_t i = 0; i < length; i++) {
> +        // This loop should be unfolded by compiler

What does that comment mean? Do you mean the compiler should unroll the loop? Do our compilers unroll the loop or not? I don't understand the purpose of the comment.

> +        // HashMap is an overkill here because there are only 4 entities.

This seems a bit too specific a comment. I would just leave it out.

> +        for (size_t m = 0; m < sizeof(entityMaps) / sizeof(EntityMap); m++) {

I don't think EntityMap is a good name for a single entry in this array. I would call that single one an EntityDescription or something. The overall structure is what I'd call a map, not a single entry.

> +                out.append('&');
> +                append(out, entityMaps[m].name);
> +                out.append(';');

I suggest just including the "&" and the ";" in the string inside the map. The cost would be low and we'd end up with easier to understand code and faster logic since we would not have three separate append calls:

> +    escapeEntities(result, attribute.characters(), attribute.length(),
> +                   EscapeAmp | EscapeLt | EscapeGt | EscapeQuot | (escapeNBSP ? EscapeNbsp : 0));

Normally we don't line up second lines like this. Instead we just indent four spaces. If the escapeEntities function was renamed the alignment would be screwed up. We don't like that kind of high-maintenance formatting.

> +        appendNodeValue(out, text, m_range, EscapeAmp | EscapeLt | EscapeGt
> +                        | (text->document()->isHTMLDocument() ? EscapeNbsp : 0));

Same comment about lining things up.

> +    String content = useRenderedText ? renderedText(text, m_range) : stringValueForRange(text, m_range);
> +    Vector<UChar> buffer;
> +    escapeEntities(buffer, content.characters(), content.length(), EscapeAmp | EscapeLt | EscapeGt);
> +    String markup = convertHTMLTextToInterchangeFormat(String::adopt(buffer), text);
>      append(out, markup);

I think this would read better if you eliminated the local variable named markup.

We might want constants for the combinations of entities. Those could help document why they are the right set of characters to encode. For example, the minimal set for text outside of an tag is &amp; and &lt; but then we include &gt; just because we think it's inelegant to escape one and not the other. Then we escape quote marks to make things safe for inside an attribute value. Not sure why we escape non-breaking spaces at all.

Using constants for the masks can help document all of this.

I'm going to say r=me on this patch but I think you may want to do quite a bit more refinement on this before landing it, so you may want to post a new patch.
Comment 3 Ryosuke Niwa 2010-08-28 17:30:44 PDT
Created attachment 65842 [details]
fixed per Darin's comment
Comment 4 Ryosuke Niwa 2010-08-28 17:40:08 PDT
Thanks for the review, Darin.

(In reply to comment #2)
> (From update of attachment 65841 [details])
> The uses of the word "escape" here don't make that much sense to me. I don't think we normally talk about entities as "escaped".

I guess. Renamed Escape* to Entity*, and added EntityMaskInCDATA, EntityMaskInPCDATA, EntityMaskInHTMLPCDATA, EntityMaskInAttributeValue, and EntityMaskInHTMLAttributeValue to define masks.

> > +enum EntityEscapeFlag {
> > +    EscapeNone  = 0x0000,
> > +    EscapeAmp   = 0x0001,
> > +    EscapeLt    = 0x0002,
> > +    EscapeGt    = 0x0004,
> > +    EscapeQuot  = 0x0008,
> > +    EscapeNbsp  = 0x0010,
> > +};
> 
> We normally avoid lining things up like this because if you change names it ends up unaligned later.

Fixed.

> > +static void escapeEntities(Vector<UChar>& out, const UChar* content, size_t length, int entitiesToEscape)
> 
> The variable name "entitiesToEscape" here should have the word "mask" or something like that in it, because otherwise it seems it could be a count.

Renamed to entityMask.

> I think that escapeEntities is not a great name for a function that appends to a vector while turning certain characters into entity sequences. I would call this appendCharactersAndEntities or something like that. The key is that it's an append function, not an escape function, since it appends characters.

I agree. Renamed to appendCharactersReplacingEntities.

> > +    size_t positionAfterLastEntity = 0;
> > +    for (size_t i = 0; i < length; i++) {
> > +        // This loop should be unfolded by compiler
> 
> What does that comment mean? Do you mean the compiler should unroll the loop? Do our compilers unroll the loop or not? I don't understand the purpose of the comment.

Removed the comment since it seemed rather redundant.

> > +        // HashMap is an overkill here because there are only 4 entities.
> 
> This seems a bit too specific a comment. I would just leave it out.

Sure, removed.

> > +        for (size_t m = 0; m < sizeof(entityMaps) / sizeof(EntityMap); m++) {
> 
> I don't think EntityMap is a good name for a single entry in this array. I would call that single one an EntityDescription or something. The overall structure is what I'd call a map, not a single entry.

Good point, renamed to EntityDescription.

> > +                out.append('&');
> > +                append(out, entityMaps[m].name);
> > +                out.append(';');
> 
> I suggest just including the "&" and the ";" in the string inside the map. The cost would be low and we'd end up with easier to understand code and faster logic since we would not have three separate append calls:

Yeah, I was considering that as well but I thought having to put & and ; at the beginning and at the end of each name is error prone.  But changed to the one you suggested since there are exactly four entities after all.  We can change back to the other way if we decide to add a lot more entities later.

> > +    escapeEntities(result, attribute.characters(), attribute.length(),
> > +                   EscapeAmp | EscapeLt | EscapeGt | EscapeQuot | (escapeNBSP ? EscapeNbsp : 0));
> 
> Normally we don't line up second lines like this. Instead we just indent four spaces. If the escapeEntities function was renamed the alignment would be screwed up. We don't like that kind of high-maintenance formatting.

Ah, XCode screwed me up on this when I copy & pasted.  Fixed.

> > +    String content = useRenderedText ? renderedText(text, m_range) : stringValueForRange(text, m_range);
> > +    Vector<UChar> buffer;
> > +    escapeEntities(buffer, content.characters(), content.length(), EscapeAmp | EscapeLt | EscapeGt);
> > +    String markup = convertHTMLTextToInterchangeFormat(String::adopt(buffer), text);
> >      append(out, markup);
> 
> I think this would read better if you eliminated the local variable named markup.

Very good point.  Fixed.

> We might want constants for the combinations of entities. Those could help document why they are the right set of characters to encode. For example, the minimal set for text outside of an tag is &amp; and &lt; but then we include &gt; just because we think it's inelegant to escape one and not the other. Then we escape quote marks to make things safe for inside an attribute value. Not sure why we escape non-breaking spaces at all.

Yeah, I was considering this as well but since we only have 4 entities, it's okay to have them directly.  Let me know if you like my new approach.

> I'm going to say r=me on this patch but I think you may want to do quite a bit more refinement on this before landing it, so you may want to post a new patch.

Re-submitted.
Comment 5 Darin Adler 2010-08-28 19:28:22 PDT
Comment on attachment 65842 [details]
fixed per Darin's comment

> +    EntityMaskInCDATA = EntityNone,
> +    EntityMaskInPCDATA = EntityAmp | EntityLt | EntityGt,
> +    EntityMaskInHTMLPCDATA = EntityMaskInPCDATA | EntityNbsp,
> +    EntityMaskInAttributeValue = EntityAmp | EntityLt | EntityGt | EntityQuot,
> +    EntityMaskInHTMLAttributeValue = EntityMaskInAttributeValue | EntityNbsp,

At some point I’d like to see comments on the above explaining why we escape what we do in each context. Here where we define the constants seems to be the perfect place.

r=me
Comment 6 Darin Adler 2010-08-28 19:32:40 PDT
(In reply to comment #4)
> Yeah, I was considering that as well but I thought having to put & and ; at the beginning and at the end of each name is error prone.  But changed to the one you suggested since there are exactly four entities after all.  We can change back to the other way if we decide to add a lot more entities later.

A side thought on this: To mitigate this "error prone" aspect all we need to do is to add an assertion that the first character is & and the last character is ; somewhere, but even better make sure we cover each case with a test and then we're fine.
Comment 7 Ryosuke Niwa 2010-08-28 22:06:05 PDT
Thanks for reviewing all of my patches.  I really appreciate it.

(In reply to comment #5)
> (From update of attachment 65842 [details])
> > +    EntityMaskInCDATA = EntityNone,
> > +    EntityMaskInPCDATA = EntityAmp | EntityLt | EntityGt,
> > +    EntityMaskInHTMLPCDATA = EntityMaskInPCDATA | EntityNbsp,
> > +    EntityMaskInAttributeValue = EntityAmp | EntityLt | EntityGt | EntityQuot,
> > +    EntityMaskInHTMLAttributeValue = EntityMaskInAttributeValue | EntityNbsp,
> 
> At some point I’d like to see comments on the above explaining why we escape what we do in each context. Here where we define the constants seems to be the perfect place.

Yes, that'll be great.  I might file a bug for that because I'd like to hear reasons from original authors as well.

(In reply to comment #6)
> A side thought on this: To mitigate this "error prone" aspect all we need to do is to add an assertion that the first character is & and the last character is ; somewhere, but even better make sure we cover each case with a test and then we're fine.

Ah! assertion seems to be a good idea but I'm pretty sure all the entities that are currently present (amp, lt, gt, & quot, nbsp) are tested.  So I'm not going to add assertion this time.
Comment 8 Ryosuke Niwa 2010-08-28 22:11:56 PDT
Committed r66326: <http://trac.webkit.org/changeset/66326>